Analysis of Food Pricing and Demographics on American HealthΒΆ

Kellen Parks-GardnerΒΆ

Table of ContentsΒΆ

  1. Introduction
  2. Data Collection
  3. Data Cleaning
  4. Data Visualization
  5. Data Analysis
  6. Insight

IntroductionΒΆ

Obesity is characterized as having a Body Mass Index (BMI), which is measured as weight divided by heigh squared, of at least 30. Having a high BMI like this, and being considered obese is considered a health risk, increasing chances for diabetes, heart disease, stroke, cancer, mental health issues and fertility issues. Another measure used by BMI is being considered overweight, which ranges from BMI values of 25 to 29.9, and being overweight carries similar health concerns, but at reduced rates. Unfortunately, in the United States, the combined rates of Americans that are obese or overweight have increased to encompass 70% of adult Americans, with obesity rates having tripled in the last 60 years.

To see why these rates of obesity and overweightness have increased, it would be beneficial to identify risk factors that can cause higher rates than normal. To do this, I will look at several factors that may play into these worrying rates. These include certain demographic variables, such as age, income, race, or education, as well as the price of food over time, and the rate of physical activity for American adults.

Data CollectionΒΆ

We will be looking at data from two datasets. The first one will be the Consumer Price Index (CPI) of food groups over time. The CPI of food is a price index, it looks at the average change in price over time, accounting for inflation. This dataset comes from the U.S. Department of Agriculture, and is all percent changes in CPI for different food groups from 1974 to 2022.

The second dataset comes from the U.S. Department of Health and Human Services. It is more complex than the other one, and is resultant from a survery taken across the country. The dataset has survery results from 2011 to 2022, taken from several different questions asked to several different demographic groups, as mentioned above in the introduction. Most data values are percent values of individuals that fall under criteria for each question.

InΒ [1]:
#good for looking at data in tables and standard for data science
import pandas as pd
import numpy as np

#these help visualize the data through graphs
import matplotlib.pyplot as plt
import seaborn as sns

#these are statistical frameworks that help analyze the data in advanced methods
import statsmodels.api
import statsmodels.formula.api as sm
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error

Step 1

Reading the data is simple, as both datasets had csv files, which are easily read into dataframe with a simple method.

InΒ [2]:
cpihist = pd.read_csv('./historicalcpi.csv')
cpihist.head()
Out[2]:
Consumer Price Index item Year Percent change
0 All food 1974 14.3
1 All food 1975 8.5
2 All food 1976 3.0
3 All food 1977 6.3
4 All food 1978 9.9
InΒ [3]:
nut_phys_etc = pd.read_csv('Nutrition__Physical_Activity__and_Obesity_-_Behavioral_Risk_Factor_Surveillance_System.csv')
nut_phys_etc.head()
Out[3]:
YearStart YearEnd LocationAbbr LocationDesc Datasource Class Topic Question Data_Value_Unit Data_Value_Type ... GeoLocation ClassID TopicID QuestionID DataValueTypeID LocationID StratificationCategory1 Stratification1 StratificationCategoryId1 StratificationID1
0 2020 2020 US National Behavioral Risk Factor Surveillance System Physical Activity Physical Activity - Behavior Percent of adults who engage in no leisure-tim... NaN Value ... NaN PA PA1 Q047 VALUE 59 Race/Ethnicity Hispanic RACE RACEHIS
1 2014 2014 GU Guam Behavioral Risk Factor Surveillance System Obesity / Weight Status Obesity / Weight Status Percent of adults aged 18 years and older who ... NaN Value ... (13.444304, 144.793731) OWS OWS1 Q036 VALUE 66 Education High school graduate EDU EDUHSGRAD
2 2013 2013 US National Behavioral Risk Factor Surveillance System Obesity / Weight Status Obesity / Weight Status Percent of adults aged 18 years and older who ... NaN Value ... NaN OWS OWS1 Q036 VALUE 59 Income $50,000 - $74,999 INC INC5075
3 2013 2013 US National Behavioral Risk Factor Surveillance System Obesity / Weight Status Obesity / Weight Status Percent of adults aged 18 years and older who ... NaN Value ... NaN OWS OWS1 Q037 VALUE 59 Income Data not reported INC INCNR
4 2015 2015 US National Behavioral Risk Factor Surveillance System Physical Activity Physical Activity - Behavior Percent of adults who achieve at least 300 min... NaN Value ... NaN PA PA1 Q045 VALUE 59 Income Less than $15,000 INC INCLESS15

5 rows Γ— 33 columns

Data CleaningΒΆ

To be able to actually work with the data from the two datasets, we will first need to alter the dataframes into how we want them. Looking at the CPI of food dataset, it seems straightforward and we will not need to work much on it. The survey data is much more complex, and we will need to change this dataset a lot to fit our needs.

Step 1

To get an idea of what the specifics of the data is, we will look at all possible elements for each column in the table. We will do this so that we know what data we want to bother working with, and how we may need to change how some data is represented in the table.

InΒ [5]:
for c in nut_phys_etc.columns:
    print(f"Unique values in {c} : {nut_phys_etc[c].unique()}")
Unique values in YearStart : [2020 2014 2013 2015 2012 2011 2017 2016 2018 2019 2022 2021]
Unique values in YearEnd : [2020 2014 2013 2015 2012 2011 2017 2016 2018 2019 2022 2021]
Unique values in LocationAbbr : ['US' 'GU' 'WY' 'DC' 'PR' 'AL' 'RI' 'DE' 'NJ' 'WA' 'ME' 'MI' 'VA' 'CA'
 'UT' 'NY' 'MA' 'AR' 'IL' 'NH' 'NM' 'MD' 'MN' 'HI' 'LA' 'SD' 'TX' 'KY'
 'WV' 'CO' 'OK' 'MS' 'OR' 'WI' 'KS' 'FL' 'ID' 'AZ' 'VI' 'MT' 'GA' 'NC'
 'PA' 'ND' 'SC' 'NE' 'TN' 'MO' 'NV' 'IA' 'IN' 'OH' 'AK' 'VT' 'CT']
Unique values in LocationDesc : ['National' 'Guam' 'Wyoming' 'District of Columbia' 'Puerto Rico'
 'Alabama' 'Rhode Island' 'Delaware' 'New Jersey' 'Washington' 'Maine'
 'Michigan' 'Virginia' 'California' 'Utah' 'New York' 'Massachusetts'
 'Arkansas' 'Illinois' 'New Hampshire' 'New Mexico' 'Maryland' 'Minnesota'
 'Hawaii' 'Louisiana' 'South Dakota' 'Texas' 'Kentucky' 'West Virginia'
 'Colorado' 'Oklahoma' 'Mississippi' 'Oregon' 'Wisconsin' 'Kansas'
 'Florida' 'Idaho' 'Arizona' 'Virgin Islands' 'Montana' 'Georgia'
 'North Carolina' 'Pennsylvania' 'North Dakota' 'South Carolina'
 'Nebraska' 'Tennessee' 'Missouri' 'Nevada' 'Iowa' 'Indiana' 'Ohio'
 'Alaska' 'Vermont' 'Connecticut']
Unique values in Datasource : ['Behavioral Risk Factor Surveillance System' 'BRFSS']
Unique values in Class : ['Physical Activity' 'Obesity / Weight Status' 'Fruits and Vegetables']
Unique values in Topic : ['Physical Activity - Behavior' 'Obesity / Weight Status'
 'Fruits and Vegetables - Behavior']
Unique values in Question : ['Percent of adults who engage in no leisure-time physical activity'
 'Percent of adults aged 18 years and older who have obesity'
 'Percent of adults aged 18 years and older who have an overweight classification'
 'Percent of adults who achieve at least 300 minutes a week of moderate-intensity aerobic physical activity or 150 minutes a week of vigorous-intensity aerobic activity (or an equivalent combination)'
 'Percent of adults who achieve at least 150 minutes a week of moderate-intensity aerobic physical activity or 75 minutes a week of vigorous-intensity aerobic physical activity and engage in muscle-strengthening activities on 2 or more days a week'
 'Percent of adults who achieve at least 150 minutes a week of moderate-intensity aerobic physical activity or 75 minutes a week of vigorous-intensity aerobic activity (or an equivalent combination)'
 'Percent of adults who engage in muscle-strengthening activities on 2 or more days a week'
 'Percent of adults who report consuming fruit less than one time daily'
 'Percent of adults who report consuming vegetables less than one time daily']
Unique values in Data_Value_Unit : [nan]
Unique values in Data_Value_Type : ['Value']
Unique values in Data_Value : [30.6 29.3 28.8 32.7 26.6 27.4 48.5 31.6 38.1 35.2 30.5 40.2 35.4 32.3
 15.3  nan 33.7 32.1 19.8 35.1 36.1 27.9 42.  16.9 34.5 31.5 50.4 32.2
 29.5 21.7 34.2 21.5 29.8 40.1 18.8 13.7 33.  25.9 20.1 36.  29.1 17.5
 38.5 35.3 38.  27.5 44.3 28.3 32.5 35.9 24.2 28.5 25.  56.1 25.1 39.
 37.4 34.  26.1 33.2 46.1 23.6 22.6 26.  10.9 42.7 35.5 29.9  8.8 22.1
 28.1 21.1 35.8 24.  42.8 41.9 33.3 33.1 23.8 38.4 36.6 29.2 27.7 23.1
 13.1 10.6 24.6 20.  31.9 15.  23.2 65.8 39.5 30.4 27.2 26.5 29.7 32.8
 22.7 10.1 38.9 19.6 30.2 36.9 20.2 26.8 52.8 41.8 15.7 15.9 34.9 27.3
 23.  37.  34.1 26.9 36.4 32.  12.5 38.2 41.4 34.3 30.8 25.5 28.9 39.2
 28.4 30.7 20.7  9.2 30.9 19.1 25.4 38.7 30.  18.4  7.  38.6 14.1 14.5
 28.7  8.3 25.6 23.5  9.4 37.2 22.9 32.6 30.3 37.1  8.5 60.5 39.1 39.8
 37.7 25.3  4.5 35.7 37.6 39.4 17.  30.1 16.4 44.6 32.9 12.2 33.4 36.3
 21.9 25.8 55.4 10.8 24.4 20.9 39.6 20.4 33.5 17.1 36.2 38.3 21.6 12.9
 11.5 31.1 14.7 39.7  2.2 36.7 31.2 25.7 23.3 27.6 43.7 24.7 25.2 26.4
 49.5 39.9 51.1 33.8 46.2 33.6 17.8 34.6 41.2 58.9 13.5 16.5 31.  48.1
  9.7 52.7 40.9 48.4 29.6 33.9 22.  22.4 27.  31.4 34.8 27.1 14.6  4.4
 10.4 15.6 26.7 42.9 44.8 40.4 16.7 40.8 28.2 24.8 43.  29.4 17.4 31.7
 41.3 36.8 35.  18.7 14.8 16.3 32.4 41.5 41.1 22.8 44.5 43.9 22.3 20.3
 29.  54.1 12.8 15.8 18.6 11.8 21.2 45.9 34.7 37.8 28.6 28.  36.5 12.4
 15.4 31.8 40.7 42.1 16.   8.1 26.3 21.8 21.4 17.3 16.8 34.4 43.2 24.1
 40.  46.3 11.6 40.6 37.9 31.3 41.7 43.1 19.5 19.3 26.2  9.6 54.2 53.6
 45.7 14.2 27.8 43.4 42.3 35.6 17.2 14.9 24.9 37.3 21.  38.8 20.8 10.2
 12.1 23.9 18.1 41.  16.2 44.4 37.5 54.5 44.  18.9 16.1 13.3 24.3 46.9
 39.3 23.7  9.3 41.6 45.2 24.5 42.4 12.3 19.9 19.7 22.5 13.9 10.7 15.5
 17.7 51.  40.3 18.2 43.6 15.1 18.3 47.9 42.2 50.9  5.9 16.6  9.8  7.3
 45.5 44.1 19.2 49.6 48.  13.2 46.   6.9 51.3 19.  23.4 12.7 18.5  7.8
 15.2 13.  18.  10.3 22.2 11.2 59.4 12.6 59.7 19.4  8.2 56.4 60.4 43.3
 53.9 20.6 40.5 50.5 50.2 45.8 46.6 44.2 45.4 12.  54.4 20.5 47.5  7.7
 43.5 48.9 53.2 42.6 62.  50.6 53.4 42.5 13.8 48.6  9.9 54.8 14.4 11.9
 14.3  6.4  8.6  9.5 51.2 47.7 52.2 50.  21.3  2.6 45.6 13.6 11.7 13.4
 47.6 55.2 58.  10.5 45.3 46.8 11.1 17.6 54.  47.3 52.5 49.4 57.  44.7
 58.6  9.1 17.9  6.1 47.4 14.  48.2 47.   6.5  7.6 11.4 55.6 53.5 46.4
 49.  50.3 11.3  8.  46.7 52.9 51.8  4.6 45.1 47.1 50.7  2.5  8.4 44.9
  4.2 45.   8.7 50.1 47.2 60.8 51.9 51.5  3.6 43.8 53.  11.  50.8  6.7
 47.8 49.8 49.1 10.  48.3 54.7 49.7  9.   6.2  7.5 46.5  5.7 52.6 49.2
 48.7 52.4 55.7 49.9 54.9 56.9 56.  59.1 52.1 58.8 48.8 57.8 55.  59.5
 65.2 58.7 51.6 55.5 54.3 62.6  7.4 63.6 51.4 56.5 53.7 63.4 59.  60.1
 55.1 53.1 70.6 53.8 60.7 58.3 57.1 61.2 58.4 60.  59.6 59.3 56.8 55.9
 71.6 52.3 49.3 52.  61.1 67.1 64.1 64.2  3.9 58.1 66.8 54.6 56.6 67.2
 64.6 62.2 59.2 68.2 55.8 56.2  7.9 57.9  8.9 59.9 58.2 58.5 60.2 60.6
  1.7 69.7 57.7 59.8 63.5 66.4 57.4 61.3 57.5 61.5 56.7  2.3 56.3  7.1
 62.3  5.2 53.3  3.3 61.7  2.8 55.3 63.   6.6 61.9 51.7 65.1 60.9 62.4
 63.1 63.3  4.1 71.4 61.4 73.1 62.5  5.8 62.8 61.  71.5 57.2  4.   2.
 64.5 65.7 63.2  0.9  5.1 70.  64.7 62.1 63.7  5.6 61.6 65.6 61.8 64.3
 57.6 66.9 65.3 66.  57.3 63.9  5.4 64.4 72.3 68.3 69.1  4.3  6.8  3.2
 68.6 69.3 70.7 62.7 65.  68.9  5.   5.5 77.6 64.8 66.2  7.2  2.4 69.
  6.  70.4  5.3 68.  70.9 66.3 60.3  3.8 66.1 67.8 71.2 64.9 64.  63.8
 66.5 65.5  4.9 62.9 66.6 67.5 67.3 70.8 72.8 69.5 70.5 71.1 66.7 67.4
 68.8  2.9 65.4 73.8 70.2 65.9 73.3  2.1 67.9 69.6  1.9  4.7 71.7 75.3
 72.6  4.8  3.   6.3 67.  70.1 70.3 74.6 68.1  3.4]
Unique values in Data_Value_Alt : [30.6 29.3 28.8 32.7 26.6 27.4 48.5 31.6 38.1 35.2 30.5 40.2 35.4 32.3
 15.3  nan 33.7 32.1 19.8 35.1 36.1 27.9 42.  16.9 34.5 31.5 50.4 32.2
 29.5 21.7 34.2 21.5 29.8 40.1 18.8 13.7 33.  25.9 20.1 36.  29.1 17.5
 38.5 35.3 38.  27.5 44.3 28.3 32.5 35.9 24.2 28.5 25.  56.1 25.1 39.
 37.4 34.  26.1 33.2 46.1 23.6 22.6 26.  10.9 42.7 35.5 29.9  8.8 22.1
 28.1 21.1 35.8 24.  42.8 41.9 33.3 33.1 23.8 38.4 36.6 29.2 27.7 23.1
 13.1 10.6 24.6 20.  31.9 15.  23.2 65.8 39.5 30.4 27.2 26.5 29.7 32.8
 22.7 10.1 38.9 19.6 30.2 36.9 20.2 26.8 52.8 41.8 15.7 15.9 34.9 27.3
 23.  37.  34.1 26.9 36.4 32.  12.5 38.2 41.4 34.3 30.8 25.5 28.9 39.2
 28.4 30.7 20.7  9.2 30.9 19.1 25.4 38.7 30.  18.4  7.  38.6 14.1 14.5
 28.7  8.3 25.6 23.5  9.4 37.2 22.9 32.6 30.3 37.1  8.5 60.5 39.1 39.8
 37.7 25.3  4.5 35.7 37.6 39.4 17.  30.1 16.4 44.6 32.9 12.2 33.4 36.3
 21.9 25.8 55.4 10.8 24.4 20.9 39.6 20.4 33.5 17.1 36.2 38.3 21.6 12.9
 11.5 31.1 14.7 39.7  2.2 36.7 31.2 25.7 23.3 27.6 43.7 24.7 25.2 26.4
 49.5 39.9 51.1 33.8 46.2 33.6 17.8 34.6 41.2 58.9 13.5 16.5 31.  48.1
  9.7 52.7 40.9 48.4 29.6 33.9 22.  22.4 27.  31.4 34.8 27.1 14.6  4.4
 10.4 15.6 26.7 42.9 44.8 40.4 16.7 40.8 28.2 24.8 43.  29.4 17.4 31.7
 41.3 36.8 35.  18.7 14.8 16.3 32.4 41.5 41.1 22.8 44.5 43.9 22.3 20.3
 29.  54.1 12.8 15.8 18.6 11.8 21.2 45.9 34.7 37.8 28.6 28.  36.5 12.4
 15.4 31.8 40.7 42.1 16.   8.1 26.3 21.8 21.4 17.3 16.8 34.4 43.2 24.1
 40.  46.3 11.6 40.6 37.9 31.3 41.7 43.1 19.5 19.3 26.2  9.6 54.2 53.6
 45.7 14.2 27.8 43.4 42.3 35.6 17.2 14.9 24.9 37.3 21.  38.8 20.8 10.2
 12.1 23.9 18.1 41.  16.2 44.4 37.5 54.5 44.  18.9 16.1 13.3 24.3 46.9
 39.3 23.7  9.3 41.6 45.2 24.5 42.4 12.3 19.9 19.7 22.5 13.9 10.7 15.5
 17.7 51.  40.3 18.2 43.6 15.1 18.3 47.9 42.2 50.9  5.9 16.6  9.8  7.3
 45.5 44.1 19.2 49.6 48.  13.2 46.   6.9 51.3 19.  23.4 12.7 18.5  7.8
 15.2 13.  18.  10.3 22.2 11.2 59.4 12.6 59.7 19.4  8.2 56.4 60.4 43.3
 53.9 20.6 40.5 50.5 50.2 45.8 46.6 44.2 45.4 12.  54.4 20.5 47.5  7.7
 43.5 48.9 53.2 42.6 62.  50.6 53.4 42.5 13.8 48.6  9.9 54.8 14.4 11.9
 14.3  6.4  8.6  9.5 51.2 47.7 52.2 50.  21.3  2.6 45.6 13.6 11.7 13.4
 47.6 55.2 58.  10.5 45.3 46.8 11.1 17.6 54.  47.3 52.5 49.4 57.  44.7
 58.6  9.1 17.9  6.1 47.4 14.  48.2 47.   6.5  7.6 11.4 55.6 53.5 46.4
 49.  50.3 11.3  8.  46.7 52.9 51.8  4.6 45.1 47.1 50.7  2.5  8.4 44.9
  4.2 45.   8.7 50.1 47.2 60.8 51.9 51.5  3.6 43.8 53.  11.  50.8  6.7
 47.8 49.8 49.1 10.  48.3 54.7 49.7  9.   6.2  7.5 46.5  5.7 52.6 49.2
 48.7 52.4 55.7 49.9 54.9 56.9 56.  59.1 52.1 58.8 48.8 57.8 55.  59.5
 65.2 58.7 51.6 55.5 54.3 62.6  7.4 63.6 51.4 56.5 53.7 63.4 59.  60.1
 55.1 53.1 70.6 53.8 60.7 58.3 57.1 61.2 58.4 60.  59.6 59.3 56.8 55.9
 71.6 52.3 49.3 52.  61.1 67.1 64.1 64.2  3.9 58.1 66.8 54.6 56.6 67.2
 64.6 62.2 59.2 68.2 55.8 56.2  7.9 57.9  8.9 59.9 58.2 58.5 60.2 60.6
  1.7 69.7 57.7 59.8 63.5 66.4 57.4 61.3 57.5 61.5 56.7  2.3 56.3  7.1
 62.3  5.2 53.3  3.3 61.7  2.8 55.3 63.   6.6 61.9 51.7 65.1 60.9 62.4
 63.1 63.3  4.1 71.4 61.4 73.1 62.5  5.8 62.8 61.  71.5 57.2  4.   2.
 64.5 65.7 63.2  0.9  5.1 70.  64.7 62.1 63.7  5.6 61.6 65.6 61.8 64.3
 57.6 66.9 65.3 66.  57.3 63.9  5.4 64.4 72.3 68.3 69.1  4.3  6.8  3.2
 68.6 69.3 70.7 62.7 65.  68.9  5.   5.5 77.6 64.8 66.2  7.2  2.4 69.
  6.  70.4  5.3 68.  70.9 66.3 60.3  3.8 66.1 67.8 71.2 64.9 64.  63.8
 66.5 65.5  4.9 62.9 66.6 67.5 67.3 70.8 72.8 69.5 70.5 71.1 66.7 67.4
 68.8  2.9 65.4 73.8 70.2 65.9 73.3  2.1 67.9 69.6  1.9  4.7 71.7 75.3
 72.6  4.8  3.   6.3 67.  70.1 70.3 74.6 68.1  3.4]
Unique values in Data_Value_Footnote_Symbol : [nan '~']
Unique values in Data_Value_Footnote : [nan 'Data not available because sample size is insufficient.']
Unique values in Low_Confidence_Limit : [29.4 25.7 28.1 31.9 25.6 18.6 32.3 24.  32.6 30.7 25.8 33.3 33.   6.9
  nan 25.1 21.6 10.5 34.6 31.5 27.6 38.4 16.  33.6 27.8 30.5 46.4 28.
 19.9 33.5 19.  26.9 36.5 17.1  9.7 29.7 24.5 17.9 32.9 16.3 36.  32.2
 34.4 25.4 24.4 40.1 26.6 35.3 22.8 22.  51.7 28.7 23.2 34.7 34.8 31.4
 33.2 23.  30.8 37.2 12.8 19.5 19.1  9.9 27.2 38.9 24.7  6.6 21.4 27.5
 16.8 15.2 20.5 34.5 25.  32.8 41.4 30.9 37.6 29.5 19.8 25.2 26.7 29.9
 25.9 21.7  9.  18.7 32.4 30.2 18.2 11.9 63.9 30.1 34.2 29.  22.4 22.5
 32.  21.   8.5 37.9 17.5 29.6 10.6 28.5 11.4 29.2 26.2 49.  28.4 37.5
 21.3 39.  12.5 18.4  9.4 23.4 31.3 30.4 31.2  7.5 31.7 22.6 29.8 26.4
 23.9 35.4 27.4 34.  26.8 18.5 20.2 27.9 16.1 23.5 14.8  6.2  7.8 11.2
 25.5 19.2 30.3  8.2 29.3 20.  12.7 13.3 47.6 28.2 35.1  6.5 23.8 56.8
 38.6 35.5 35.  24.9  3.5 35.9 16.5 13.9 32.7 36.1 27.7 39.3 27.3  8.9
 20.4 15.9 36.8  9.8 20.6 36.3 28.6 15.5 14.5 31.6 20.7 35.6 27.1 16.7
 32.1  7.  18.  13.4  1.6 26.1 14.4 17.7 26.  18.8 21.8 24.3 48.8 41.8
 10.1 38.5 22.2 58.4 28.9 38.3 23.6 21.1 23.3 10.3 22.9 33.7 28.8 37.
 50.2 11.6 21.5 20.3 13.5 26.5 30.6 10.7  1.4  3.9 23.1 24.6 37.7 14.7
 39.1 37.8  8.3 36.9 31.8 27.  22.1 25.3 10.8 23.7 26.3 10.9 15.3 20.8
 17.6 36.4  6.7 15.1 24.1 15.4 19.6 38.  33.9 40.6 24.2 11.1 50.6  4.9
 33.8 36.2 17.3 11.3 12.1  7.6 33.4 32.5 34.3 40.  19.3 18.9 20.1 17.8
  8.8 38.8 11.   7.1 34.1 12.3 37.4 42.4 14.  35.2 24.8 21.2 29.1 43.6
 18.3 41.3 17.4 16.2 13.7  6.1 18.1 51.9 50.3 36.6 11.5 28.3 12.  13.6
 22.3 34.9  9.6 19.7 14.9 11.8 15.8 14.6  7.2 46.9 14.2 33.1 16.9 15.7
 35.7 12.2 41.  40.5 31.1 30.  13.1 10.  22.7 20.9 12.6 16.6 19.4 45.2
  5.6 40.7 39.9 40.3  8.6  9.3  9.5 42.7 36.7 12.9 17.2 46.7  4.6 14.3
 39.6 13.2  6.8  8.1  2.6  8.  31.  39.5 43.3  8.4 42.3 16.4 49.2 15.6
 38.1 12.4 13.8 10.2 35.8  9.1 58.9 53.7 39.8 13.  21.9 17.   6.3 39.4
 50.5 55.9 51.6 44.2 42.1 41.6 15.  38.7 37.1 48.5 44.6 14.1 41.5 42.9
 37.3  3.2 40.9  4.8 41.1 48.  11.7 52.6  6.4 40.4 44.3 52.3  5.8 39.2
  3.3  7.3  3.   4.7 42.5 47.9 48.7 46.  42.8  0.8 41.2  9.2 44.  49.6
 41.7 53.9 43.9 40.8 38.2  7.9  7.7 10.4 50.7 45.6 44.8 43.7 40.2 52.1
 42.2  4.5 45.7 43.   4.3  3.6 48.6 45.5 43.5 44.1  5.7  8.7  4.2 45.
 53.4  5.9 48.2 48.4 51.3  3.4 42.6 43.4 44.5 42.   0.7 53.1  4.  39.7
 54.5  2.5 45.1 49.3 46.3  7.4  5.1  5.   2.4 43.8  5.2  2.9  5.4 45.3
 41.9 52.  52.8 52.9 55.  48.9 47.1 47.2 59.1 56.2 46.1 58.3 54.  45.9
 52.4 53.5  5.3 47.5 54.6 46.2 44.7 58.2 57.3 50.  49.4 49.5 44.9  1.7
 51.   5.5 43.1 61.4 53.  48.3 60.2 54.2 57.2 53.6 68.1 51.2 47.4 44.4
 55.2 46.6 47.7 53.3 49.8 49.9  2.2 55.6 57.6 53.2 55.7 50.4 52.2 51.1
 69.5 49.1 46.8 45.8 47.8 51.4 52.5 43.2 55.3 47.3 49.7 53.8 47.  61.1
 56.5 62.3  1.  62.6 48.1 51.5 64.7 62.1 57.4 66.4 52.7 57.8 54.7 56.
 54.9 58.5  2.3 56.3 50.1  0.9 46.5 66.6  6.  56.6 60.4 56.4 63.3 59.5
 55.4 62.7 59.9  3.7 45.4 58.1 54.1 61.8 61.2 57.  50.8 60.9 55.8  3.8
  1.3 54.8 57.1 57.9 59.  59.7  1.2  0.6 60.8  4.4  0.3 50.9 66.5 55.5
 62.4 57.5 59.6 59.3 51.8 54.3 65.2 64.1 60.6 58.  61.9 63.  62.2 56.1
 62.8 54.4 55.1 56.9 66.7 58.8 60.7 67.2 63.1 64.3  2.7  3.1 64.6 56.7
  2.   1.9 60.1 57.7 58.7  2.8 59.2 58.6 60.5  1.8 62.9 61.7 63.2 63.4
 67.8 60.  61.  62.5 68.  59.8 60.3 64.2 61.5 65.6 64.8 59.4 61.6 63.8
 70.2 66.  67.7 61.3 65.   1.5  4.1 64.9 67.3 64.5 65.1 65.4  0.4 63.6
 66.8 67.  63.5 66.9 69.  62.  67.9 68.2  1.1]
Unique values in High_Confidence_Limit  : [31.8 33.3 29.5 33.5 27.6 38.5 64.9 40.4 43.8 40.  35.6 47.4 37.9 39.8
 30.6  nan 43.4 44.8 34.4 35.7 41.  28.3 45.7 17.8 38.7 41.9 32.5 54.3
 33.  31.1 23.7 32.6 35.  24.3 32.8 20.7 19.2 36.4 27.3 22.4 39.2 30.2
 18.8 41.1 41.7 29.7 28.9 48.6 30.1 42.7 37.5 25.7 31.4 28.4 60.5 27.1
 33.2 34.8 29.6 55.3 26.1 34.2 12.  52.3 46.6 11.6 22.8 28.6 35.5 41.5
 27.8 51.5 28.2 38.  28.8 42.3 46.3 36.9 31.7 28.1 42.5 34.5 32.7 29.8
 24.5 24.1 12.3 31.6 34.  31.  40.6 21.9 36.1 26.8 67.6 30.7 35.8 54.2
 36.2 30.8 31.9 24.4 11.9 21.7 46.2 35.2 27.4 56.6 40.1 44.6 19.4 32.1
 26.6 43.  38.1 34.9 29.2 38.4 20.2 49.8 37.  42.9 37.1 39.6 34.3 32.
 27.2 35.3 52.2 43.2 32.2 35.9 31.2 23.2 11.3 23.  33.8 36.5 22.3 35.1
 42.8 22.6  7.9 41.4 18.5 32.4 10.5 26.7 28.5 10.8 34.7 46.  44.4 16.5
 37.4 45.8 19.3 49.3 39.1 11.1 26.2 64.  44.3 42.2  5.9 36.  43.1 17.5
 30.5 19.1 40.3 43.6 50.  39.  16.6 44.  41.2 37.6 23.5 59.1 47.9 29.
 21.2 26.3 33.9 38.3 20.1 48.  25.8 39.4 28.  44.9 23.1 22.5 18.6 30.
 16.1 41.6 36.6  3.1 33.6 23.3 40.5 39.7 21.3 24.8 45.5 38.8 48.5 50.3
 48.1 28.7 60.3 57.  29.4 17.4 39.9 32.9 59.5 45.3 36.7 37.2 24.7 32.3
 17.7 38.2 42.1 49.7 11.2 45.9 55.1 49.2 38.9 24.6 45.6 36.3 29.3 45.
 42.6 47.5 40.8 33.4 58.2 19.6 12.7 46.8 25.2 54.6 22.  43.3 56.4 62.7
 37.8 26.5 25.9 51.6 30.3 51.1 39.3 37.3 48.2 43.9 54.9 47.8 40.9 17.3
 29.1 43.5 44.1 44.2 44.7 25.5 49.1 35.4 47.2 54.7 41.3 36.8 30.9 38.6
 72.5 23.4 33.1 14.4 54.8 20.  12.9 43.7 14.9 31.5 23.8 51.9 37.7 48.3
 14.3 34.1 68.4  9.1 39.5 29.9 46.9 56.7 50.4 53.1 40.7 46.5 45.4 51.
 57.6 53.3 15.1 21.6 25.6 17.1 26.4 24.  14.8 48.4 56.8 55.  21.  42.
 16.8 50.9 46.1 21.4 46.7 41.8 33.7 51.4 42.4 22.1 11.4 18.7 22.2 50.6
 26.  20.3 62.  52.8 18.4 51.3 15.3 40.2 61.6 58.5 55.4 34.6 31.3 45.1
 11.  27.5 65.9 50.7 12.6 47.1 45.2 13.2 18.1 21.5 20.8 17.6 12.8 54.4
 19.  19.9 20.4 27.9 13.5 23.9 51.7 19.7 22.7 13.7 13.9 53.2 25.1  7.5
 27.7 27.  54.5 20.9 61.  52.5 48.8 22.9 55.9 54.1 21.1 30.4 17.2 21.8
 53.4 53.  14.2 20.5  8.7 48.7 49.  15.8 52.1 56.  11.8 15.6 59.9  9.
 13.  65.4 19.8 18.  10.4 24.9 25.  26.9 15.9 10.3  9.7 14.7 62.2 64.6
 18.2 58.3 59.2 56.2 24.2 47.3 50.2 16.9 51.2 56.1 20.6 16.4 15.2 13.6
 50.1 60.7 25.3 19.5 49.9 16.3 52.4 12.4 23.6 60.2 56.3 18.9 46.4 14.5
 18.3 59.4 25.4 47.  17.  49.4 65.8 58.4 59.7 67.7 62.8 70.6 56.9 13.1
 58.  52.7 76.  49.6 13.3 44.5 47.6 49.5 56.5 63.   9.9 15.5 66.3 55.8
 55.6 69.1 14.1 53.9 15.4 12.5  8.3 14.6 17.9 64.1 47.7 51.8 15.  69.8
 61.4 57.2 53.5 54.  13.4 74.9 61.2 57.4 73.3 63.2 15.7  7.7 60.9 50.5
 53.8 16.  55.7 11.5 14.  62.4 48.9 16.7 60.8 70.1 16.2 65.  55.2 58.9
 57.3 10.6  6.2 52.6 59.6 57.7 53.6  8.2 50.8  8.4 57.1 66.6 53.7 57.8
 12.2 61.9 10.1 66.7 62.9 60.1  5.1 68.7 59.  70.2 52.  10.7 58.6 61.8
 60.6 13.8 55.5 59.3 64.8 10.2 66.4 60.4 67.9  7.8  9.3 61.3 58.8 10.9
 65.7 63.9 52.9 68.9 66.5 67.3 68.6 74.4 62.1 63.1 75.2 60.  58.7 57.5
 63.6 68.1 58.1 59.8 57.9 72.9 12.1  9.8 11.7 62.3 63.4 65.1 64.2 62.5
 66.2 61.1 73.7 68.  80.9 61.5 65.2 63.7 67.  65.3 66.1 76.2 70.7 69.5
 74.2 69.9 70.5  6.1 75.  65.5  9.2 72.3 64.7 72.2 61.7 66.8  3.4 72.6
 63.8 10.  70.3 63.5  5.4 69.4  3.7 71.2 65.6  7.6 62.6 64.5  7.3  4.4
 73.1  4.7 66.9 73.2 71.8 67.4 68.3 66.   7.4 63.3 72.   6.4  9.5  4.6
 84.3 72.4 69.3 75.8 71.9 75.5 72.1 73.6 84.8  6.9 73.8 64.4  3.  71.5
 76.1  9.4 67.1 67.2 67.8 69.2 75.6  8.8 64.3  8.6 74.5 83.2 74.8 71.3
 68.8 69.  72.8 73.9 84.  67.5 76.6 83.4 78.6 70.  86.8  8.1 68.5 70.9
  6.7 69.7 74.7 77.8  6.3 71.1 80.7 84.2  6.6 79.1 71.  77.4 71.7  8.5
 76.3 83.  72.7  5.3  5.8 70.4  7.  79.8 79.2 74.6 77.  74.1  5.7 68.2
 76.9 78.9 75.9 79.9  6.8 73.4 71.6 79.6 81.6  8.9 69.6 74.3  5.5  9.6
 80.1 82.4 83.9 77.1  7.2 76.4 85.  80.  71.4 75.1  5.  80.6 81.2 70.8
 79.4 78.7 77.5 76.8 86.6 75.4 87.7 81.1 78.5 73.5 84.6 78.3 82.9 74.
 78.1 75.7  8.  75.3 77.9 85.2]
Unique values in Sample_Size : [31255.   842. 62562. ...  9468.  4640.  3650.]
Unique values in Total : [nan 'Total']
Unique values in Age(years) : [nan '25 - 34' '55 - 64' '18 - 24' '45 - 54' '35 - 44' '65 or older']
Unique values in Education : [nan 'High school graduate' 'Less than high school'
 'Some college or technical school' 'College graduate']
Unique values in Gender : [nan 'Female' 'Male']
Unique values in Income : [nan '$50,000 - $74,999' 'Data not reported' 'Less than $15,000'
 '$25,000 - $34,999' '$15,000 - $24,999' '$35,000 - $49,999'
 '$75,000 or greater']
Unique values in Race/Ethnicity : ['Hispanic' nan 'American Indian/Alaska Native' 'Asian'
 'Non-Hispanic White' 'Other' '2 or more races'
 'Hawaiian/Pacific Islander' 'Non-Hispanic Black']
Unique values in GeoLocation : [nan '(13.444304, 144.793731)' '(43.235541343, -108.109830353)'
 '(38.890371385, -77.031961127)' '(18.220833, -66.590149)'
 '(32.840571122, -86.631860762)' '(41.708280193, -71.522470314)'
 '(39.008830667000495, -75.57774116799965)'
 '(40.130570048, -74.273691288)' '(47.522278629, -120.47001079)'
 '(45.254228894000505, -68.98503133599962)'
 '(44.661319543001, -84.71439027)' '(37.542680674, -78.457890463)'
 '(37.638640123, -120.999999538)' '(39.360700171, -111.587130635)'
 '(42.827001032, -75.543970427)' '(42.27687047, -72.082690675)'
 '(39.008830667, -75.577741168)' '(34.748650124, -92.274490743)'
 '(40.485010283, -88.997710178)' '(43.655950113, -71.50036092)'
 '(34.520880952, -106.240580985)'
 '(39.29058096400047, -76.60926011099963)'
 '(46.35564873600049, -94.79420050299967)'
 '(21.304850435, -157.857749403)' '(31.312660644, -92.445680071)'
 '(44.353130053000484, -100.3735306369997)'
 '(31.827240407, -99.426770206)' '(37.645970271, -84.774971048)'
 '(44.353130053, -100.373530637)'
 '(43.23554134300048, -108.10983035299967)'
 '(38.665510202, -80.712640135)' '(38.843840757, -106.133610921)'
 '(35.472031356, -97.521070214)' '(32.745510099, -89.538030825)'
 '(44.567449424, -120.155031326)' '(44.393191174, -89.816370742)'
 '(38.3477403, -98.200781227)' '(28.932040377, -81.928960539)'
 '(43.682630005, -114.363730042)' '(45.254228894001, -68.985031336)'
 '(34.86597028, -111.763811277)' '(18.335765, -64.896335)'
 '(32.84057112200048, -86.63186076199969)'
 '(47.066528972, -109.424420645)'
 '(38.89037138500049, -77.03196112699965)' '(46.355648736, -94.794200503)'
 '(32.839681093, -83.627580346)' '(35.466220975, -79.159250463)'
 '(40.793730152, -77.860700294)' '(47.475319779, -100.118421049)'
 '(33.998821303, -81.045371207)' '(41.641040988001, -99.365720623)'
 '(35.68094058, -85.774490914)' '(38.635790776, -92.566300053)'
 '(39.49324039, -117.071840564)'
 '(31.827240407000488, -99.42677020599967)'
 '(39.360700171000474, -111.58713063499971)'
 '(39.290580964, -76.609260111)' '(42.469400913, -93.816490556)'
 '(39.766910452, -86.149960194)' '(32.83968109300048, -83.62758034599966)'
 '(42.46940091300047, -93.81649055599968)'
 '(42.27687047000046, -72.08269067499964)' '(40.060210141, -82.404260056)'
 '(64.845079957001, -147.722059036)' '(43.625381239, -72.517640791)'
 '(41.56266102, -72.649840952)' '(40.13057004800049, -74.27369128799967)'
 '(44.6613195430005, -84.71439026999968)'
 '(43.62538123900049, -72.51764079099962)'
 '(34.865970280000454, -111.76381127699972)'
 '(40.06021014100048, -82.40426005599966)'
 '(40.79373015200048, -77.86070029399963)'
 '(39.493240390000494, -117.07184056399967)'
 '(33.998821303000454, -81.04537120699968)'
 '(37.54268067400045, -78.45789046299967)'
 '(34.52088095200048, -106.24058098499967)'
 '(47.06652897200047, -109.42442064499971)'
 '(44.39319117400049, -89.81637074199966)'
 '(37.63864012300047, -120.99999953799971)'
 '(31.31266064400046, -92.44568007099969)'
 '(44.56744942400047, -120.15503132599969)'
 '(35.47203135600046, -97.52107021399968)'
 '(47.47531977900047, -100.11842104899966)'
 '(35.68094058000048, -85.77449091399967)'
 '(47.52227862900048, -120.47001078999972)'
 '(38.66551020200046, -80.71264013499967)'
 '(64.84507995700051, -147.72205903599973)'
 '(38.635790776000476, -92.56630005299968)'
 '(35.466220975000454, -79.15925046299964)'
 '(41.6410409880005, -99.36572062299967)'
 '(38.34774030000045, -98.20078122699965)'
 '(37.645970271000465, -84.77497104799966)'
 '(41.56266102000046, -72.64984095199964)'
 '(39.766910452000445, -86.14996019399968)'
 '(32.745510099000455, -89.53803082499968)'
 '(41.70828019300046, -71.52247031399963)'
 '(28.932040377000476, -81.92896053899966)'
 '(38.843840757000464, -106.13361092099967)'
 '(43.682630005000476, -114.3637300419997)'
 '(40.48501028300046, -88.99771017799969)'
 '(42.82700103200045, -75.54397042699964)'
 '(21.304850435000446, -157.85774940299973)'
 '(34.74865012400045, -92.27449074299966)'
 '(43.65595011300047, -71.50036091999965)']
Unique values in ClassID : ['PA' 'OWS' 'FV']
Unique values in TopicID : ['PA1' 'OWS1' 'FV1']
Unique values in QuestionID : ['Q047' 'Q036' 'Q037' 'Q045' 'Q044' 'Q043' 'Q046' 'Q018' 'Q019']
Unique values in DataValueTypeID : ['VALUE']
Unique values in LocationID : [59 66 56 11 72  1 44 10 34 53 23 26 51  6 49 36 25  5 17 33 35 24 27 15
 22 46 48 21 54  8 40 28 41 55 20 12 16  4 78 30 13 37 42 38 45 31 47 29
 32 19 18 39  2 50  9]
Unique values in StratificationCategory1 : ['Race/Ethnicity' 'Education' 'Income' 'Age (years)' 'Gender' 'Total' nan]
Unique values in Stratification1 : ['Hispanic' 'High school graduate' '$50,000 - $74,999' 'Data not reported'
 'Less than $15,000' 'American Indian/Alaska Native'
 'Less than high school' '$25,000 - $34,999' '25 - 34' 'Asian'
 'Non-Hispanic White' 'Other' '55 - 64' 'Some college or technical school'
 'Female' '18 - 24' '45 - 54' '$15,000 - $24,999' 'Total' 'Male' '35 - 44'
 '2 or more races' 'Hawaiian/Pacific Islander' 'College graduate'
 'Non-Hispanic Black' '$35,000 - $49,999' '$75,000 or greater'
 '65 or older' nan]
Unique values in StratificationCategoryId1 : ['RACE' 'EDU' 'INC' 'AGEYR' 'GEN' 'OVR' nan]
Unique values in StratificationID1 : ['RACEHIS' 'EDUHSGRAD' 'INC5075' 'INCNR' 'INCLESS15' 'RACENAA' 'EDUHS'
 'INC2535' 'AGEYR2534' 'RACEASN' 'RACEWHT' 'RACEOTH' 'AGEYR5564'
 'EDUCOTEC' 'FEMALE' 'AGEYR1824' 'AGEYR4554' 'INC1525' 'OVERALL' 'MALE'
 'AGEYR3544' 'RACE2PLUS' 'RACEHPI' 'EDUCOGRAD' 'RACEBLK' 'INC3550'
 'INC75PLUS' 'AGEYR65PLUS' nan]

Step 2

There is a lot of data present in this table, but not all of it is necessary. To begin with, we will cut down the table's size by only including survery responses to questions that are relevant to our goal of identifying health risks. We just want to include the questions finding obesity and overweight rates, as well as inactive adults.

InΒ [6]:
#the questions we care about
nut_list = ['Percent of adults who engage in no leisure-time physical activity',
 'Percent of adults aged 18 years and older who have obesity',
 'Percent of adults aged 18 years and older who have an overweight classification']

#we created a new filtered dataframe from the original bloated set
cleaned_ntr = nut_phys_etc[nut_phys_etc['Question'].isin(nut_list)]
print(cleaned_ntr)
       YearStart  YearEnd LocationAbbr LocationDesc  \
0           2020     2020           US     National   
1           2014     2014           GU         Guam   
2           2013     2013           US     National   
3           2013     2013           US     National   
6           2012     2012           WY      Wyoming   
...          ...      ...          ...          ...   
93244       2022     2022           WY      Wyoming   
93245       2022     2022           WY      Wyoming   
93246       2022     2022           WY      Wyoming   
93247       2022     2022           WY      Wyoming   
93248       2022     2022           WY      Wyoming   

                                       Datasource                    Class  \
0      Behavioral Risk Factor Surveillance System        Physical Activity   
1      Behavioral Risk Factor Surveillance System  Obesity / Weight Status   
2      Behavioral Risk Factor Surveillance System  Obesity / Weight Status   
3      Behavioral Risk Factor Surveillance System  Obesity / Weight Status   
6      Behavioral Risk Factor Surveillance System  Obesity / Weight Status   
...                                           ...                      ...   
93244                                       BRFSS  Obesity / Weight Status   
93245                                       BRFSS        Physical Activity   
93246                                       BRFSS  Obesity / Weight Status   
93247                                       BRFSS  Obesity / Weight Status   
93248                                       BRFSS  Obesity / Weight Status   

                              Topic  \
0      Physical Activity - Behavior   
1           Obesity / Weight Status   
2           Obesity / Weight Status   
3           Obesity / Weight Status   
6           Obesity / Weight Status   
...                             ...   
93244       Obesity / Weight Status   
93245  Physical Activity - Behavior   
93246       Obesity / Weight Status   
93247       Obesity / Weight Status   
93248       Obesity / Weight Status   

                                                Question  Data_Value_Unit  \
0      Percent of adults who engage in no leisure-tim...              NaN   
1      Percent of adults aged 18 years and older who ...              NaN   
2      Percent of adults aged 18 years and older who ...              NaN   
3      Percent of adults aged 18 years and older who ...              NaN   
6      Percent of adults aged 18 years and older who ...              NaN   
...                                                  ...              ...   
93244  Percent of adults aged 18 years and older who ...              NaN   
93245  Percent of adults who engage in no leisure-tim...              NaN   
93246  Percent of adults aged 18 years and older who ...              NaN   
93247  Percent of adults aged 18 years and older who ...              NaN   
93248  Percent of adults aged 18 years and older who ...              NaN   

      Data_Value_Type  ...                               GeoLocation  ClassID  \
0               Value  ...                                       NaN       PA   
1               Value  ...                   (13.444304, 144.793731)      OWS   
2               Value  ...                                       NaN      OWS   
3               Value  ...                                       NaN      OWS   
6               Value  ...            (43.235541343, -108.109830353)      OWS   
...               ...  ...                                       ...      ...   
93244           Value  ...  (43.23554134300048, -108.10983035299967)      OWS   
93245           Value  ...  (43.23554134300048, -108.10983035299967)       PA   
93246           Value  ...  (43.23554134300048, -108.10983035299967)      OWS   
93247           Value  ...  (43.23554134300048, -108.10983035299967)      OWS   
93248           Value  ...  (43.23554134300048, -108.10983035299967)      OWS   

      TopicID QuestionID  DataValueTypeID  LocationID  \
0         PA1       Q047            VALUE          59   
1        OWS1       Q036            VALUE          66   
2        OWS1       Q036            VALUE          59   
3        OWS1       Q037            VALUE          59   
6        OWS1       Q037            VALUE          56   
...       ...        ...              ...         ...   
93244    OWS1       Q037            VALUE          56   
93245     PA1       Q047            VALUE          56   
93246    OWS1       Q036            VALUE          56   
93247    OWS1       Q037            VALUE          56   
93248    OWS1       Q036            VALUE          56   

       StratificationCategory1                Stratification1  \
0               Race/Ethnicity                       Hispanic   
1                    Education           High school graduate   
2                       Income              $50,000 - $74,999   
3                       Income              Data not reported   
6               Race/Ethnicity  American Indian/Alaska Native   
...                        ...                            ...   
93244                   Income              Less than $15,000   
93245                Education          Less than high school   
93246              Age (years)                        35 - 44   
93247                   Income              $35,000 - $49,999   
93248                Education          Less than high school   

      StratificationCategoryId1 StratificationID1  
0                          RACE           RACEHIS  
1                           EDU         EDUHSGRAD  
2                           INC           INC5075  
3                           INC             INCNR  
6                          RACE           RACENAA  
...                         ...               ...  
93244                       INC         INCLESS15  
93245                       EDU             EDUHS  
93246                     AGEYR         AGEYR3544  
93247                       INC           INC3550  
93248                       EDU             EDUHS  

[54323 rows x 33 columns]

Step 3

After giving a closer look at the table, we see that some columns have missing values, represented as NaN. This will be an issue in our analysis as we cannot work with missing data. We need to find what causes this data to be missing, and how we can either filter it out, or replace it.

InΒ [6]:
missing_values = cleaned_ntr.isna().sum()
print(missing_values)
YearStart                         0
YearEnd                           0
LocationAbbr                      0
LocationDesc                      0
Datasource                        0
Class                             0
Topic                             0
Question                          0
Data_Value_Unit               54323
Data_Value_Type                   0
Data_Value                     5264
Data_Value_Alt                 5264
Data_Value_Footnote_Symbol    49059
Data_Value_Footnote           49059
Low_Confidence_Limit           5264
High_Confidence_Limit          5264
Sample_Size                    5264
Total                         52383
Age(years)                    42683
Education                     46563
Gender                        50443
Income                        40743
Race/Ethnicity                38803
GeoLocation                    1008
ClassID                           0
TopicID                           0
QuestionID                        0
DataValueTypeID                   0
LocationID                        0
StratificationCategory1           3
Stratification1                   3
StratificationCategoryId1         3
StratificationID1                 3
dtype: int64

Step 4

We see that our most important column, Data_Value, has mising values. These are the actual percent values we need for analysis. If these are missing, then we cannot do anything with those entries in the table. Our first step will be to remove the instances of this insufficient data, and continue pruning the missing entries.

InΒ [8]:
remove_insuff_ntr = cleaned_ntr[cleaned_ntr['Data_Value'].notna()]
missing_values = remove_insuff_ntr.isna().sum()
print(missing_values)
print(remove_insuff_ntr)
YearStart                         0
YearEnd                           0
LocationAbbr                      0
LocationDesc                      0
Datasource                        0
Class                             0
Topic                             0
Question                          0
Data_Value_Unit               49059
Data_Value_Type                   0
Data_Value                        0
Data_Value_Alt                    0
Data_Value_Footnote_Symbol    49059
Data_Value_Footnote           49059
Low_Confidence_Limit              0
High_Confidence_Limit             0
Sample_Size                       0
Total                         47128
Age(years)                    37473
Education                     41335
Gender                        45197
Income                        35545
Race/Ethnicity                38617
GeoLocation                    1005
ClassID                           0
TopicID                           0
QuestionID                        0
DataValueTypeID                   0
LocationID                        0
StratificationCategory1           0
Stratification1                   0
StratificationCategoryId1         0
StratificationID1                 0
dtype: int64
       YearStart  YearEnd LocationAbbr LocationDesc  \
0           2020     2020           US     National   
1           2014     2014           GU         Guam   
2           2013     2013           US     National   
3           2013     2013           US     National   
6           2012     2012           WY      Wyoming   
...          ...      ...          ...          ...   
93244       2022     2022           WY      Wyoming   
93245       2022     2022           WY      Wyoming   
93246       2022     2022           WY      Wyoming   
93247       2022     2022           WY      Wyoming   
93248       2022     2022           WY      Wyoming   

                                       Datasource                    Class  \
0      Behavioral Risk Factor Surveillance System        Physical Activity   
1      Behavioral Risk Factor Surveillance System  Obesity / Weight Status   
2      Behavioral Risk Factor Surveillance System  Obesity / Weight Status   
3      Behavioral Risk Factor Surveillance System  Obesity / Weight Status   
6      Behavioral Risk Factor Surveillance System  Obesity / Weight Status   
...                                           ...                      ...   
93244                                       BRFSS  Obesity / Weight Status   
93245                                       BRFSS        Physical Activity   
93246                                       BRFSS  Obesity / Weight Status   
93247                                       BRFSS  Obesity / Weight Status   
93248                                       BRFSS  Obesity / Weight Status   

                              Topic  \
0      Physical Activity - Behavior   
1           Obesity / Weight Status   
2           Obesity / Weight Status   
3           Obesity / Weight Status   
6           Obesity / Weight Status   
...                             ...   
93244       Obesity / Weight Status   
93245  Physical Activity - Behavior   
93246       Obesity / Weight Status   
93247       Obesity / Weight Status   
93248       Obesity / Weight Status   

                                                Question  Data_Value_Unit  \
0      Percent of adults who engage in no leisure-tim...              NaN   
1      Percent of adults aged 18 years and older who ...              NaN   
2      Percent of adults aged 18 years and older who ...              NaN   
3      Percent of adults aged 18 years and older who ...              NaN   
6      Percent of adults aged 18 years and older who ...              NaN   
...                                                  ...              ...   
93244  Percent of adults aged 18 years and older who ...              NaN   
93245  Percent of adults who engage in no leisure-tim...              NaN   
93246  Percent of adults aged 18 years and older who ...              NaN   
93247  Percent of adults aged 18 years and older who ...              NaN   
93248  Percent of adults aged 18 years and older who ...              NaN   

      Data_Value_Type  ...                               GeoLocation  ClassID  \
0               Value  ...                                       NaN       PA   
1               Value  ...                   (13.444304, 144.793731)      OWS   
2               Value  ...                                       NaN      OWS   
3               Value  ...                                       NaN      OWS   
6               Value  ...            (43.235541343, -108.109830353)      OWS   
...               ...  ...                                       ...      ...   
93244           Value  ...  (43.23554134300048, -108.10983035299967)      OWS   
93245           Value  ...  (43.23554134300048, -108.10983035299967)       PA   
93246           Value  ...  (43.23554134300048, -108.10983035299967)      OWS   
93247           Value  ...  (43.23554134300048, -108.10983035299967)      OWS   
93248           Value  ...  (43.23554134300048, -108.10983035299967)      OWS   

      TopicID QuestionID  DataValueTypeID  LocationID  \
0         PA1       Q047            VALUE          59   
1        OWS1       Q036            VALUE          66   
2        OWS1       Q036            VALUE          59   
3        OWS1       Q037            VALUE          59   
6        OWS1       Q037            VALUE          56   
...       ...        ...              ...         ...   
93244    OWS1       Q037            VALUE          56   
93245     PA1       Q047            VALUE          56   
93246    OWS1       Q036            VALUE          56   
93247    OWS1       Q037            VALUE          56   
93248    OWS1       Q036            VALUE          56   

       StratificationCategory1                Stratification1  \
0               Race/Ethnicity                       Hispanic   
1                    Education           High school graduate   
2                       Income              $50,000 - $74,999   
3                       Income              Data not reported   
6               Race/Ethnicity  American Indian/Alaska Native   
...                        ...                            ...   
93244                   Income              Less than $15,000   
93245                Education          Less than high school   
93246              Age (years)                        35 - 44   
93247                   Income              $35,000 - $49,999   
93248                Education          Less than high school   

      StratificationCategoryId1 StratificationID1  
0                          RACE           RACEHIS  
1                           EDU         EDUHSGRAD  
2                           INC           INC5075  
3                           INC             INCNR  
6                          RACE           RACENAA  
...                         ...               ...  
93244                       INC         INCLESS15  
93245                       EDU             EDUHS  
93246                     AGEYR         AGEYR3544  
93247                       INC           INC3550  
93248                       EDU             EDUHS  

[49059 rows x 33 columns]

Step 5

We see that three columns, Data_Value_Unit, Data_Value_Footnote_Symbol, and Data_Value_Footnote all have as many missing values as rows present in the table. This means every entry for these three is NaN, and they are a waste of columns. Our next step will be to remove them, and then to see remaining missing values, and the new-look dataframe.

InΒ [12]:
drop_empty_ntr = remove_insuff_ntr.drop(['Data_Value_Unit', 'Data_Value_Footnote_Symbol', 'Data_Value_Footnote'], axis=1)
missing_values = drop_empty_ntr.isna().sum()
print(missing_values)
drop_empty_ntr.head()
YearStart                        0
YearEnd                          0
LocationAbbr                     0
LocationDesc                     0
Datasource                       0
Class                            0
Topic                            0
Question                         0
Data_Value_Type                  0
Data_Value                       0
Data_Value_Alt                   0
Low_Confidence_Limit             0
High_Confidence_Limit            0
Sample_Size                      0
Total                        47128
Age(years)                   37473
Education                    41335
Gender                       45197
Income                       35545
Race/Ethnicity               38617
GeoLocation                   1005
ClassID                          0
TopicID                          0
QuestionID                       0
DataValueTypeID                  0
LocationID                       0
StratificationCategory1          0
Stratification1                  0
StratificationCategoryId1        0
StratificationID1                0
dtype: int64
Out[12]:
YearStart YearEnd LocationAbbr LocationDesc Datasource Class Topic Question Data_Value_Type Data_Value ... GeoLocation ClassID TopicID QuestionID DataValueTypeID LocationID StratificationCategory1 Stratification1 StratificationCategoryId1 StratificationID1
0 2020 2020 US National Behavioral Risk Factor Surveillance System Physical Activity Physical Activity - Behavior Percent of adults who engage in no leisure-tim... Value 30.6 ... NaN PA PA1 Q047 VALUE 59 Race/Ethnicity Hispanic RACE RACEHIS
1 2014 2014 GU Guam Behavioral Risk Factor Surveillance System Obesity / Weight Status Obesity / Weight Status Percent of adults aged 18 years and older who ... Value 29.3 ... (13.444304, 144.793731) OWS OWS1 Q036 VALUE 66 Education High school graduate EDU EDUHSGRAD
2 2013 2013 US National Behavioral Risk Factor Surveillance System Obesity / Weight Status Obesity / Weight Status Percent of adults aged 18 years and older who ... Value 28.8 ... NaN OWS OWS1 Q036 VALUE 59 Income $50,000 - $74,999 INC INC5075
3 2013 2013 US National Behavioral Risk Factor Surveillance System Obesity / Weight Status Obesity / Weight Status Percent of adults aged 18 years and older who ... Value 32.7 ... NaN OWS OWS1 Q037 VALUE 59 Income Data not reported INC INCNR
6 2012 2012 WY Wyoming Behavioral Risk Factor Surveillance System Obesity / Weight Status Obesity / Weight Status Percent of adults aged 18 years and older who ... Value 48.5 ... (43.235541343, -108.109830353) OWS OWS1 Q037 VALUE 56 Race/Ethnicity American Indian/Alaska Native RACE RACENAA

5 rows Γ— 30 columns

Step 6

Looking back at our long list of unique values, we will notice other unnecessary columns. We do not need more than one column for Location, Class, Topic, or Question, so we will drop their ID columns. We also do not need a datasource column, it is only one source. Clas will not be necessary for our analysis either, as the questions will be enough to separate our analysis. Geolocation does not seem to be relevant, we will not be looking at coordinate data, so we will drop that as well.

We will also check to see if we need both YearStart and YearEnd. If there are 0 instances of the survery going beyond one calendar year, we will only need one year column for our data.

InΒ [13]:
drop_red_ntr = drop_empty_ntr.drop(['LocationAbbr', 'Datasource', 'Class', 'Data_Value_Type', 'ClassID', 'TopicID', 'QuestionID', 'DataValueTypeID', 'LocationID', 'GeoLocation'], axis=1)
check_years = (drop_red_ntr['YearStart'] != drop_red_ntr['YearEnd']).sum()
print(check_years)
0
InΒ [14]:
drop_red_ntr = drop_red_ntr.drop('YearEnd', axis=1)
drop_red_ntr = drop_red_ntr.rename(columns={'YearStart': 'Year'})
drop_red_ntr.head()
Out[14]:
Year LocationDesc Topic Question Data_Value Data_Value_Alt Low_Confidence_Limit High_Confidence_Limit Sample_Size Total Age(years) Education Gender Income Race/Ethnicity StratificationCategory1 Stratification1 StratificationCategoryId1 StratificationID1
0 2020 National Physical Activity - Behavior Percent of adults who engage in no leisure-tim... 30.6 30.6 29.4 31.8 31255.0 NaN NaN NaN NaN NaN Hispanic Race/Ethnicity Hispanic RACE RACEHIS
1 2014 Guam Obesity / Weight Status Percent of adults aged 18 years and older who ... 29.3 29.3 25.7 33.3 842.0 NaN NaN High school graduate NaN NaN NaN Education High school graduate EDU EDUHSGRAD
2 2013 National Obesity / Weight Status Percent of adults aged 18 years and older who ... 28.8 28.8 28.1 29.5 62562.0 NaN NaN NaN NaN $50,000 - $74,999 NaN Income $50,000 - $74,999 INC INC5075
3 2013 National Obesity / Weight Status Percent of adults aged 18 years and older who ... 32.7 32.7 31.9 33.5 60069.0 NaN NaN NaN NaN Data not reported NaN Income Data not reported INC INCNR
6 2012 Wyoming Obesity / Weight Status Percent of adults aged 18 years and older who ... 48.5 48.5 32.3 64.9 69.0 NaN NaN NaN NaN NaN American Indian/Alaska Native Race/Ethnicity American Indian/Alaska Native RACE RACENAA

Step 7

We now have pruned out most of the unnecessary data from our dataframe, and we will look to see if there are other forms of missing data present. Looking at the stratification categories, which are areas where the survery was separated by demographic group, we will see that there is one result that has incomplete data. Some respones for income did not report that income, which is represented with StratificationID1 of INCNR, so we will drop that as well.

We will do one final check on missing data and see that we now have 0 missing data that could cause problems. The reason we are okay with the values for the stratification categories, is that each question was asked independently, so a survery response involving education would not count for age, gender, income, or race.

InΒ [16]:
print(drop_red_ntr['StratificationCategory1'].unique())
print(drop_red_ntr['StratificationID1'].unique())
print(drop_red_ntr['Stratification1'].unique())
['Race/Ethnicity' 'Education' 'Income' 'Age (years)' 'Gender' 'Total']
['RACEHIS' 'EDUHSGRAD' 'INC5075' 'RACENAA' 'EDUHS' 'INC2535' 'AGEYR2534'
 'INCLESS15' 'RACEASN' 'AGEYR5564' 'RACEOTH' 'EDUCOTEC' 'FEMALE'
 'AGEYR1824' 'AGEYR4554' 'INC1525' 'OVERALL' 'MALE' 'AGEYR3544'
 'RACE2PLUS' 'EDUCOGRAD' 'RACEWHT' 'INC3550' 'INC75PLUS' 'AGEYR65PLUS'
 'RACEBLK' 'RACEHPI']
['Hispanic' 'High school graduate' '$50,000 - $74,999'
 'American Indian/Alaska Native' 'Less than high school'
 '$25,000 - $34,999' '25 - 34' 'Less than $15,000' 'Asian' '55 - 64'
 'Other' 'Some college or technical school' 'Female' '18 - 24' '45 - 54'
 '$15,000 - $24,999' 'Total' 'Male' '35 - 44' '2 or more races'
 'College graduate' 'Non-Hispanic White' '$35,000 - $49,999'
 '$75,000 or greater' '65 or older' 'Non-Hispanic Black'
 'Hawaiian/Pacific Islander']
InΒ [17]:
drop_red_ntr = drop_red_ntr[~drop_red_ntr['StratificationID1'].isin(['INCNR'])]
print(drop_red_ntr.isna().sum())
drop_red_ntr.head()
Year                             0
LocationDesc                     0
Topic                            0
Question                         0
Data_Value                       0
Data_Value_Alt                   0
Low_Confidence_Limit             0
High_Confidence_Limit            0
Sample_Size                      0
Total                        45197
Age(years)                   35542
Education                    39404
Gender                       43266
Income                       35545
Race/Ethnicity               36686
StratificationCategory1          0
Stratification1                  0
StratificationCategoryId1        0
StratificationID1                0
dtype: int64
Out[17]:
Year LocationDesc Topic Question Data_Value Data_Value_Alt Low_Confidence_Limit High_Confidence_Limit Sample_Size Total Age(years) Education Gender Income Race/Ethnicity StratificationCategory1 Stratification1 StratificationCategoryId1 StratificationID1
0 2020 National Physical Activity - Behavior Percent of adults who engage in no leisure-tim... 30.6 30.6 29.4 31.8 31255.0 NaN NaN NaN NaN NaN Hispanic Race/Ethnicity Hispanic RACE RACEHIS
1 2014 Guam Obesity / Weight Status Percent of adults aged 18 years and older who ... 29.3 29.3 25.7 33.3 842.0 NaN NaN High school graduate NaN NaN NaN Education High school graduate EDU EDUHSGRAD
2 2013 National Obesity / Weight Status Percent of adults aged 18 years and older who ... 28.8 28.8 28.1 29.5 62562.0 NaN NaN NaN NaN $50,000 - $74,999 NaN Income $50,000 - $74,999 INC INC5075
6 2012 Wyoming Obesity / Weight Status Percent of adults aged 18 years and older who ... 48.5 48.5 32.3 64.9 69.0 NaN NaN NaN NaN NaN American Indian/Alaska Native Race/Ethnicity American Indian/Alaska Native RACE RACENAA
7 2012 District of Columbia Obesity / Weight Status Percent of adults aged 18 years and older who ... 31.6 31.6 24.0 40.4 243.0 NaN NaN Less than high school NaN NaN NaN Education Less than high school EDU EDUHS

Step 8

One last thing we will want to do is to make our dataframe easier to manage. Our CPI data is counted at the national level, so state data may vary greatly from that. To solve that, we will denote regions for states, and disregard U.S. territories from the data. We will follow regions as follows: regions

Our next step will be to aggregate our data. We want to create a dataframe that will look for each year, for each question, separated by their stratification value and region. We want to get the average percent values for each of these combinations and we will use the groupby function to achieve this. We will also notice we have no more missing values in this aggregated dataframe.

InΒ [18]:
#our regions
west = ['Washington', 'Oregon', 'California', 'Idaho', 'Nevada', 'Utah', 'Montana', 'Wyoming', 'Utah', 'Colorado', 'Alaska', 'Hawaii']
midwest = ['North Dakota', 'South Dakota', 'Nebraska', 'Kansas', 'Minnesota', 'Iowa', 'Missouri', 'Michigan', 'Illinois', 'Michigan', 'Indiana', 'Ohio']
northeast = ['Pennsylvania', 'Maryland', 'Delaware', 'New Jersey', 'New York', 'New Hampshire', 'Vermont', 'Connecticut', 'Rhode Island', 'Maine', 'Massachusetts']
southeast = ['District of Columbia', 'Virginia', 'North Carolina', 'South Carolina', 'Georgia', 'Florida', 'West Virginia', 'Kentucky', 'Tennessee', 'Arkansas', 'Louisiana', 'Mississippi', 'Alabama']
southwest = ['Oklahoma', 'Texas', 'New Mexico', 'Arizona']

regions = {'National': 'National'}
for state in west:
    regions[state] = 'West'
for state in midwest:
    regions[state] = 'Midwest'
for state in northeast:
    regions[state] = 'Northeast'
for state in southeast:
    regions[state] = 'Southeast'
for state in southwest:
    regions[state] = 'Southwest'

#we create a new column that puts the region for each entry
drop_red_ntr['Region'] = drop_red_ntr['LocationDesc'].map(regions)

#we drop U.S. territories (minus D.C)
reg_ntr = drop_red_ntr.dropna(subset=['Region'])

#we use groupby the four variables, and take the mean percent respone, and print out the resultant dataframe and missing values
#we use reset_index to allow for the grouped by columns to be colummns, rather than multiindex values
reg_avgs = reg_ntr.groupby(['Year', 'Question', 'Stratification1', 'Region']).agg({
    'Data_Value': 'mean'
}).reset_index()
print(reg_avgs['Data_Value'].isna().sum())
print(reg_avgs)
0
      Year                                           Question  \
0     2011  Percent of adults aged 18 years and older who ...   
1     2011  Percent of adults aged 18 years and older who ...   
2     2011  Percent of adults aged 18 years and older who ...   
3     2011  Percent of adults aged 18 years and older who ...   
4     2011  Percent of adults aged 18 years and older who ...   
...    ...                                                ...   
5669  2022  Percent of adults who engage in no leisure-tim...   
5670  2022  Percent of adults who engage in no leisure-tim...   
5671  2022  Percent of adults who engage in no leisure-tim...   
5672  2022  Percent of adults who engage in no leisure-tim...   
5673  2022  Percent of adults who engage in no leisure-tim...   

        Stratification1     Region  Data_Value  
0     $15,000 - $24,999    Midwest   34.018182  
1     $15,000 - $24,999   National   34.500000  
2     $15,000 - $24,999  Northeast   33.763636  
3     $15,000 - $24,999  Southeast   32.553846  
4     $15,000 - $24,999  Southwest   34.375000  
...                 ...        ...         ...  
5669              Total   National   23.700000  
5670              Total  Northeast   22.463636  
5671              Total  Southeast   26.007692  
5672              Total  Southwest   25.650000  
5673              Total       West   20.427273  

[5674 rows x 5 columns]

Step 9

We will now get the CPI dataframe to be what we want. Our first step will be to find and handle missing values. We see that there are 25 of these, which is few enough to look at individually. Luckily for us, these missing values are for entries outside of our scope. Our survey time range is from 2011 to 2022, so these missing values from 1974 to 1998 are irrelevant in our analysis.

InΒ [18]:
print(cpihist.isna().sum())
Consumer Price Index item     0
Year                          0
Percent change               25
dtype: int64
InΒ [19]:
missing_cpi = cpihist[cpihist['Percent change'].isna()]
missing_cpi.head(25)
Out[19]:
Consumer Price Index item Year Percent change
833 Processed fruits and vegetables 1974 NaN
834 Processed fruits and vegetables 1975 NaN
835 Processed fruits and vegetables 1976 NaN
836 Processed fruits and vegetables 1977 NaN
837 Processed fruits and vegetables 1978 NaN
838 Processed fruits and vegetables 1979 NaN
839 Processed fruits and vegetables 1980 NaN
840 Processed fruits and vegetables 1981 NaN
841 Processed fruits and vegetables 1982 NaN
842 Processed fruits and vegetables 1983 NaN
843 Processed fruits and vegetables 1984 NaN
844 Processed fruits and vegetables 1985 NaN
845 Processed fruits and vegetables 1986 NaN
846 Processed fruits and vegetables 1987 NaN
847 Processed fruits and vegetables 1988 NaN
848 Processed fruits and vegetables 1989 NaN
849 Processed fruits and vegetables 1990 NaN
850 Processed fruits and vegetables 1991 NaN
851 Processed fruits and vegetables 1992 NaN
852 Processed fruits and vegetables 1993 NaN
853 Processed fruits and vegetables 1994 NaN
854 Processed fruits and vegetables 1995 NaN
855 Processed fruits and vegetables 1996 NaN
856 Processed fruits and vegetables 1997 NaN
857 Processed fruits and vegetables 1998 NaN

Step 10

Similar to before, we want a look at what the data could represent. We first want to make sure the year column is an integer, this will help to remove entries outside of our time range. We see all of the food groups and the range of years.

We will disregard all entries from before 2010. The reason we chose 2010 instead fo 2011 is that we will be using the 2010 values as a baseline for analysis. This will help to see the effects of CPI changes from 2011.

InΒ [19]:
print(cpihist['Year'].dtype)
for column in cpihist.columns:
    print(cpihist[column].unique())
int64
['All food' 'Food away from home' 'Food at home'
 'Meats, poultry, and fish' 'Meats' 'Beef and veal' 'Pork' 'Other meats'
 'Poultry' 'Fish and seafood' 'Eggs' 'Dairy products' 'Fats and oils'
 'Fruits and vegetables' 'Fresh fruits and vegetables' 'Fresh fruits'
 'Fresh vegetables' 'Processed fruits and vegetables' 'Sugar and sweets'
 'Cereals and bakery products' 'Nonalcoholic beverages' 'Other foods']
[1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987
 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001
 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015
 2016 2017 2018 2019 2020 2021 2022]
[ 14.3   8.5   3.    6.3   9.9  11.    8.6   7.8   4.1   2.1   3.8   2.3
   3.2   5.8   2.9   1.2   2.2   2.4   2.8   3.3   2.6   1.8   3.4   4.
   5.5   0.8   3.7   1.4   1.9   0.3   0.9   3.9  12.7   9.4   6.8   7.6
   9.1  11.1   9.    5.4   4.4   4.2   4.6   4.7   2.    1.7   2.5   3.1
   3.6   3.5   1.3   4.5   7.7  14.9   8.2   5.9  10.5  10.8   8.1   7.2
   1.    1.5   4.3   6.5   0.7   6.4   0.5   4.8  -1.3  -0.2   0.4  11.4
  -0.6  16.7  -0.7  -0.4   5.    7.3  -0.8   7.4  -3.5  -0.1   9.6   0.2
  -2.3  18.6  17.    4.9  -1.2  -0.9   7.5  10.1  -1.4   0.6   0.1  -1.9
   5.7   8.4   8.8   9.2  -4.4  -3.2  22.9  27.4  -1.5  -2.1   8.   -0.3
  11.5  -1.   10.2  12.1  -6.3   1.6   9.3   5.3  -0.5  22.4  -5.4  12.9
  -3.3  12.8  -3.   14.7  -4.7   5.2  -1.8   5.6  -2.   -3.9  -4.1   8.7
  17.7  14.2  -5.1  10.4  -1.7  10.6   6.2   5.1  -2.7  14.6  15.3  11.7
   9.7   8.3   1.1  -1.6   7.1   9.5 -16.6  -5.9  26.6 -10.6  -2.4  17.9
  13.8 -13.7  29.2  14.  -14.7  17.8 -21.1  -9.5 -10.   32.2   2.7  11.6
   9.8   6.6  -1.1   7.   -6.4  12.   41.9  10.7 -12.5   7.9   6.7  -2.2
   6.1  18.5  16.5   8.9  13.6  10.9  12.2  10.3  -4.8  19.5  12.4  -4.2
  11.2  13.5  -5.    6.   -6.1  12.6  -4.3  -3.4   0.    nan  52.4  26.1
 -11.3  12.3  23.   29.9  11.3  10.   13.   19.3  15.   19.6  50.6  -2.6
   6.9  21.4]
InΒ [20]:
cpihist = cpihist[cpihist['Year'] >= 2010]
cpihist.head()
Out[20]:
Consumer Price Index item Year Percent change
36 All food 2010 0.8
37 All food 2011 3.7
38 All food 2012 2.6
39 All food 2013 1.4
40 All food 2014 2.4

Step 11

Our next step is to add a price index column to the dataframe. We will set all values in 2010 to be a baseline of 100. For each year and food item, we will simply apply the formula:

formula

This accounts for each previous year's percent change into the current year's index. Because this dataframe is rather short, we will also print out all values for each food item just to take a quick glance over the data, ensuring its correctness.

InΒ [21]:
def create_index(item):
    list = [100]
    #for each item, applies the above formula to the entry for price index
    for r in range(1, len(item)):
        next = list[-1] * (1 + item['Percent change'].iloc[r] / 100)
        list.append(next)
        
    item['Price Index'] = list
    return item

cpi_index = cpihist.groupby('Consumer Price Index item').apply(create_index).reset_index(drop=True)
cpi_index.head()
/var/folders/fw/m7c8y6mj7qq2w3wxy4ynlxph0000gn/T/ipykernel_98603/2485733993.py:10: DeprecationWarning: DataFrameGroupBy.apply operated on the grouping columns. This behavior is deprecated, and in a future version of pandas the grouping columns will be excluded from the operation. Either pass `include_groups=False` to exclude the groupings or explicitly select the grouping columns after groupby to silence this warning.
  cpi_index = cpihist.groupby('Consumer Price Index item').apply(create_index).reset_index(drop=True)
Out[21]:
Consumer Price Index item Year Percent change Price Index
0 All food 2010 0.8 100.000000
1 All food 2011 3.7 103.700000
2 All food 2012 2.6 106.396200
3 All food 2013 1.4 107.885747
4 All food 2014 2.4 110.475005
InΒ [23]:
for group in cpi_index.groupby('Consumer Price Index item'):
    print(group)
('All food',    Consumer Price Index item  Year  Percent change  Price Index
0                   All food  2010             0.8   100.000000
1                   All food  2011             3.7   103.700000
2                   All food  2012             2.6   106.396200
3                   All food  2013             1.4   107.885747
4                   All food  2014             2.4   110.475005
5                   All food  2015             1.9   112.574030
6                   All food  2016             0.3   112.911752
7                   All food  2017             0.9   113.927958
8                   All food  2018             1.4   115.522949
9                   All food  2019             1.9   117.717885
10                  All food  2020             3.4   121.720293
11                  All food  2021             3.9   126.467385
12                  All food  2022             9.9   138.987656)
('Beef and veal',    Consumer Price Index item  Year  Percent change  Price Index
13             Beef and veal  2010             2.9   100.000000
14             Beef and veal  2011            10.2   110.200000
15             Beef and veal  2012             6.4   117.252800
16             Beef and veal  2013             2.0   119.597856
17             Beef and veal  2014            12.1   134.069197
18             Beef and veal  2015             7.2   143.722179
19             Beef and veal  2016            -6.3   134.667681
20             Beef and veal  2017            -1.2   133.051669
21             Beef and veal  2018             1.4   134.914393
22             Beef and veal  2019             1.6   137.073023
23             Beef and veal  2020             9.6   150.232033
24             Beef and veal  2021             9.3   164.203612
25             Beef and veal  2022             5.3   172.906404)
('Cereals and bakery products',       Consumer Price Index item  Year  Percent change  Price Index
26  Cereals and bakery products  2010            -0.8   100.000000
27  Cereals and bakery products  2011             3.9   103.900000
28  Cereals and bakery products  2012             2.8   106.809200
29  Cereals and bakery products  2013             1.0   107.877292
30  Cereals and bakery products  2014             0.2   108.093047
31  Cereals and bakery products  2015             1.1   109.282070
32  Cereals and bakery products  2016            -0.3   108.954224
33  Cereals and bakery products  2017            -0.5   108.409453
34  Cereals and bakery products  2018             0.4   108.843091
35  Cereals and bakery products  2019             1.4   110.366894
36  Cereals and bakery products  2020             2.2   112.794966
37  Cereals and bakery products  2021             2.3   115.389250
38  Cereals and bakery products  2022            13.0   130.389852)
('Dairy products',    Consumer Price Index item  Year  Percent change  Price Index
39            Dairy products  2010             1.1   100.000000
40            Dairy products  2011             6.8   106.800000
41            Dairy products  2012             2.1   109.042800
42            Dairy products  2013             0.1   109.151843
43            Dairy products  2014             3.6   113.081309
44            Dairy products  2015            -1.3   111.611252
45            Dairy products  2016            -2.3   109.044193
46            Dairy products  2017             0.1   109.153238
47            Dairy products  2018            -0.5   108.607471
48            Dairy products  2019             1.0   109.693546
49            Dairy products  2020             4.4   114.520062
50            Dairy products  2021             1.4   116.123343
51            Dairy products  2022            12.0   130.058144)
('Eggs',    Consumer Price Index item  Year  Percent change  Price Index
52                      Eggs  2010             1.5   100.000000
53                      Eggs  2011             9.2   109.200000
54                      Eggs  2012             3.2   112.694400
55                      Eggs  2013             3.3   116.413315
56                      Eggs  2014             8.4   126.192034
57                      Eggs  2015            17.8   148.654216
58                      Eggs  2016           -21.1   117.288176
59                      Eggs  2017            -9.5   106.145799
60                      Eggs  2018            10.8   117.609546
61                      Eggs  2019           -10.0   105.848591
62                      Eggs  2020             4.3   110.400081
63                      Eggs  2021             4.5   115.368084
64                      Eggs  2022            32.2   152.516607)
('Fats and oils',    Consumer Price Index item  Year  Percent change  Price Index
65             Fats and oils  2010            -0.3   100.000000
66             Fats and oils  2011             9.3   109.300000
67             Fats and oils  2012             6.1   115.967300
68             Fats and oils  2013            -1.4   114.343758
69             Fats and oils  2014             0.1   114.458102
70             Fats and oils  2015            -1.0   113.313521
71             Fats and oils  2016            -0.6   112.633639
72             Fats and oils  2017             0.8   113.534709
73             Fats and oils  2018             0.1   113.648243
74             Fats and oils  2019            -0.7   112.852706
75             Fats and oils  2020             1.3   114.319791
76             Fats and oils  2021             4.6   119.578501
77             Fats and oils  2022            18.5   141.700524)
('Fish and seafood',    Consumer Price Index item  Year  Percent change  Price Index
78          Fish and seafood  2010             1.1   100.000000
79          Fish and seafood  2011             7.1   107.100000
80          Fish and seafood  2012             2.4   109.670400
81          Fish and seafood  2013             2.5   112.412160
82          Fish and seafood  2014             5.8   118.932065
83          Fish and seafood  2015            -0.9   117.861677
84          Fish and seafood  2016            -0.7   117.036645
85          Fish and seafood  2017             1.2   118.441085
86          Fish and seafood  2018             2.1   120.928347
87          Fish and seafood  2019             1.6   122.863201
88          Fish and seafood  2020             3.3   126.917687
89          Fish and seafood  2021             5.4   133.771242
90          Fish and seafood  2022             9.1   145.944425)
('Food at home',     Consumer Price Index item  Year  Percent change  Price Index
91               Food at home  2010             0.3   100.000000
92               Food at home  2011             4.8   104.800000
93               Food at home  2012             2.5   107.420000
94               Food at home  2013             0.9   108.386780
95               Food at home  2014             2.4   110.988063
96               Food at home  2015             1.2   112.319919
97               Food at home  2016            -1.3   110.859761
98               Food at home  2017            -0.2   110.638041
99               Food at home  2018             0.4   111.080593
100              Food at home  2019             0.9   112.080319
101              Food at home  2020             3.5   116.003130
102              Food at home  2021             3.5   120.063239
103              Food at home  2022            11.4   133.750448)
('Food away from home',     Consumer Price Index item  Year  Percent change  Price Index
104       Food away from home  2010             1.3   100.000000
105       Food away from home  2011             2.3   102.300000
106       Food away from home  2012             2.8   105.164400
107       Food away from home  2013             2.1   107.372852
108       Food away from home  2014             2.4   109.949801
109       Food away from home  2015             2.9   113.138345
110       Food away from home  2016             2.6   116.079942
111       Food away from home  2017             2.3   118.749781
112       Food away from home  2018             2.6   121.837275
113       Food away from home  2019             3.1   125.614231
114       Food away from home  2020             3.4   129.885114
115       Food away from home  2021             4.5   135.729945
116       Food away from home  2022             7.7   146.181150)
('Fresh fruits',     Consumer Price Index item  Year  Percent change  Price Index
117              Fresh fruits  2010            -0.6   100.000000
118              Fresh fruits  2011             3.3   103.300000
119              Fresh fruits  2012             1.0   104.333000
120              Fresh fruits  2013             2.0   106.419660
121              Fresh fruits  2014             4.8   111.527804
122              Fresh fruits  2015            -2.2   109.074192
123              Fresh fruits  2016             2.2   111.473824
124              Fresh fruits  2017             0.5   112.031193
125              Fresh fruits  2018             1.0   113.151505
126              Fresh fruits  2019            -1.4   111.567384
127              Fresh fruits  2020            -0.8   110.674845
128              Fresh fruits  2021             5.5   116.761962
129              Fresh fruits  2022             7.9   125.986157)
('Fresh fruits and vegetables',        Consumer Price Index item  Year  Percent change  Price Index
130  Fresh fruits and vegetables  2010             0.7   100.000000
131  Fresh fruits and vegetables  2011             4.5   104.500000
132  Fresh fruits and vegetables  2012            -2.0   102.410000
133  Fresh fruits and vegetables  2013             3.3   105.789530
134  Fresh fruits and vegetables  2014             1.9   107.799531
135  Fresh fruits and vegetables  2015            -0.5   107.260533
136  Fresh fruits and vegetables  2016             1.2   108.547660
137  Fresh fruits and vegetables  2017             0.2   108.764755
138  Fresh fruits and vegetables  2018             1.1   109.961167
139  Fresh fruits and vegetables  2019             1.0   111.060779
140  Fresh fruits and vegetables  2020             0.8   111.949265
141  Fresh fruits and vegetables  2021             3.3   115.643591
142  Fresh fruits and vegetables  2022             7.5   124.316860)
('Fresh vegetables',     Consumer Price Index item  Year  Percent change  Price Index
143          Fresh vegetables  2010             2.0   100.000000
144          Fresh vegetables  2011             5.6   105.600000
145          Fresh vegetables  2012            -5.1   100.214400
146          Fresh vegetables  2013             4.7   104.924477
147          Fresh vegetables  2014            -1.3   103.560459
148          Fresh vegetables  2015             1.6   105.217426
149          Fresh vegetables  2016             0.0   105.217426
150          Fresh vegetables  2017            -0.1   105.112209
151          Fresh vegetables  2018             1.1   106.268443
152          Fresh vegetables  2019             3.8   110.306644
153          Fresh vegetables  2020             2.6   113.174616
154          Fresh vegetables  2021             1.1   114.419537
155          Fresh vegetables  2022             7.0   122.428905)
('Fruits and vegetables',     Consumer Price Index item  Year  Percent change  Price Index
156     Fruits and vegetables  2010             0.2   100.000000
157     Fruits and vegetables  2011             4.1   104.100000
158     Fruits and vegetables  2012            -0.6   103.475400
159     Fruits and vegetables  2013             2.5   106.062285
160     Fruits and vegetables  2014             1.5   107.653219
161     Fruits and vegetables  2015            -0.2   107.437913
162     Fruits and vegetables  2016             0.8   108.297416
163     Fruits and vegetables  2017            -0.2   108.080821
164     Fruits and vegetables  2018             0.7   108.837387
165     Fruits and vegetables  2019             1.0   109.925761
166     Fruits and vegetables  2020             1.4   111.464722
167     Fruits and vegetables  2021             3.2   115.031593
168     Fruits and vegetables  2022             8.5   124.809278)
('Meats',     Consumer Price Index item  Year  Percent change  Price Index
169                     Meats  2010             2.8   100.000000
170                     Meats  2011             8.8   108.800000
171                     Meats  2012             3.4   112.499200
172                     Meats  2013             1.2   113.849190
173                     Meats  2014             9.2   124.323316
174                     Meats  2015             3.0   128.053015
175                     Meats  2016            -4.4   122.418683
176                     Meats  2017            -0.6   121.684171
177                     Meats  2018             0.4   122.170907
178                     Meats  2019             1.3   123.759129
179                     Meats  2020             7.4   132.917305
180                     Meats  2021             7.7   143.151937
181                     Meats  2022             8.2   154.890396)
('Meats, poultry, and fish',     Consumer Price Index item  Year  Percent change  Price Index
182  Meats, poultry, and fish  2010             1.9   100.000000
183  Meats, poultry, and fish  2011             7.4   107.400000
184  Meats, poultry, and fish  2012             3.6   111.266400
185  Meats, poultry, and fish  2013             2.1   113.602994
186  Meats, poultry, and fish  2014             7.2   121.782410
187  Meats, poultry, and fish  2015             1.9   124.096276
188  Meats, poultry, and fish  2016            -3.5   119.752906
189  Meats, poultry, and fish  2017            -0.1   119.633153
190  Meats, poultry, and fish  2018             0.7   120.470585
191  Meats, poultry, and fish  2019             1.0   121.675291
192  Meats, poultry, and fish  2020             6.3   129.340834
193  Meats, poultry, and fish  2021             6.8   138.136011
194  Meats, poultry, and fish  2022             9.6   151.397068)
('Nonalcoholic beverages',     Consumer Price Index item  Year  Percent change  Price Index
195    Nonalcoholic beverages  2010            -0.9   100.000000
196    Nonalcoholic beverages  2011             3.2   103.200000
197    Nonalcoholic beverages  2012             1.1   104.335200
198    Nonalcoholic beverages  2013            -1.0   103.291848
199    Nonalcoholic beverages  2014            -0.5   102.775389
200    Nonalcoholic beverages  2015             1.1   103.905918
201    Nonalcoholic beverages  2016            -0.4   103.490294
202    Nonalcoholic beverages  2017             0.2   103.697275
203    Nonalcoholic beverages  2018             0.0   103.697275
204    Nonalcoholic beverages  2019             1.9   105.667523
205    Nonalcoholic beverages  2020             3.6   109.471554
206    Nonalcoholic beverages  2021             2.8   112.536758
207    Nonalcoholic beverages  2022            11.0   124.915801)
('Other foods',     Consumer Price Index item  Year  Percent change  Price Index
208               Other foods  2010            -0.5   100.000000
209               Other foods  2011             2.3   102.300000
210               Other foods  2012             3.5   105.880500
211               Other foods  2013             0.5   106.409902
212               Other foods  2014             1.0   107.474002
213               Other foods  2015             1.6   109.193586
214               Other foods  2016             0.3   109.521166
215               Other foods  2017             0.1   109.630687
216               Other foods  2018             0.1   109.740318
217               Other foods  2019             0.3   110.069539
218               Other foods  2020             3.1   113.481695
219               Other foods  2021             2.2   115.978292
220               Other foods  2022            12.7   130.707535)
('Other meats',     Consumer Price Index item  Year  Percent change  Price Index
221               Other meats  2010            -0.1   100.000000
222               Other meats  2011             6.4   106.400000
223               Other meats  2012             1.7   108.208800
224               Other meats  2013            -0.1   108.100591
225               Other meats  2014             3.9   112.316514
226               Other meats  2015             4.1   116.921491
227               Other meats  2016            -0.9   115.869198
228               Other meats  2017            -0.9   114.826375
229               Other meats  2018            -0.4   114.367070
230               Other meats  2019             0.9   115.396373
231               Other meats  2020             4.4   120.473814
232               Other meats  2021             2.9   123.967554
233               Other meats  2022            14.2   141.570947)
('Pork',     Consumer Price Index item  Year  Percent change  Price Index
234                      Pork  2010             4.7   100.000000
235                      Pork  2011             8.5   108.500000
236                      Pork  2012             0.3   108.825500
237                      Pork  2013             0.9   109.804929
238                      Pork  2014             9.1   119.797178
239                      Pork  2015            -3.9   115.125088
240                      Pork  2016            -4.1   110.404960
241                      Pork  2017             0.6   111.067389
242                      Pork  2018            -0.4   110.623120
243                      Pork  2019             1.2   111.950597
244                      Pork  2020             6.3   119.003485
245                      Pork  2021             8.6   129.237784
246                      Pork  2022             8.7   140.481472)
('Poultry',     Consumer Price Index item  Year  Percent change  Price Index
247                   Poultry  2010            -0.1   100.000000
248                   Poultry  2011             2.9   102.900000
249                   Poultry  2012             5.5   108.559500
250                   Poultry  2013             4.7   113.661796
251                   Poultry  2014             2.0   115.935032
252                   Poultry  2015             0.4   116.398773
253                   Poultry  2016            -2.7   113.256006
254                   Poultry  2017             0.2   113.482518
255                   Poultry  2018             0.3   113.822965
256                   Poultry  2019            -0.3   113.481496
257                   Poultry  2020             5.6   119.836460
258                   Poultry  2021             5.1   125.948120
259                   Poultry  2022            14.6   144.336545)
('Processed fruits and vegetables',            Consumer Price Index item  Year  Percent change  Price Index
260  Processed fruits and vegetables  2010            -1.3   100.000000
261  Processed fruits and vegetables  2011             2.9   102.900000
262  Processed fruits and vegetables  2012             3.8   106.810200
263  Processed fruits and vegetables  2013             0.3   107.130631
264  Processed fruits and vegetables  2014             0.1   107.237761
265  Processed fruits and vegetables  2015             0.7   107.988426
266  Processed fruits and vegetables  2016            -0.3   107.664460
267  Processed fruits and vegetables  2017            -1.6   105.941829
268  Processed fruits and vegetables  2018            -0.6   105.306178
269  Processed fruits and vegetables  2019             1.1   106.464546
270  Processed fruits and vegetables  2020             3.5   110.190805
271  Processed fruits and vegetables  2021             2.9   113.386338
272  Processed fruits and vegetables  2022            12.0   126.992699)
('Sugar and sweets',     Consumer Price Index item  Year  Percent change  Price Index
273          Sugar and sweets  2010             2.2   100.000000
274          Sugar and sweets  2011             3.3   103.300000
275          Sugar and sweets  2012             3.3   106.708900
276          Sugar and sweets  2013            -1.7   104.894849
277          Sugar and sweets  2014            -0.8   104.055690
278          Sugar and sweets  2015             3.2   107.385472
279          Sugar and sweets  2016            -0.4   106.955930
280          Sugar and sweets  2017            -0.1   106.848974
281          Sugar and sweets  2018             0.4   107.276370
282          Sugar and sweets  2019             2.0   109.421897
283          Sugar and sweets  2020             3.3   113.032820
284          Sugar and sweets  2021             3.0   116.423805
285          Sugar and sweets  2022            10.4   128.531880)

Data VisualizationΒΆ

To better understand our data, we are going to visualize it. We will be primarily using line graphs and scatter plots to try and find visual relationships and trend between data items present in our two dataframe. We will be using matplotlib and seaborn to handle all of our visualizations.

Step 1

Our first graph will be graphing each price index of each food group over time in our CPI data. We will do this to see if food prices have increased. This is beneficial, as we can then use these food trends to see if the price of food has an impact on American adult health.

Looking at the resultant graph, we see that every food group has increased in price since 2010, with some growing much higher in price than others, such as Beef and Veal, Dairy, Meats, and Eggs. It is also clear that since 2020 and COVID, food prices have increased at higher rates than before, which is expected considering the supply issues present from the time.

InΒ [27]:
food_groups = cpi_index['Consumer Price Index item'].unique()

plt.figure(figsize=(15,15))
#we want to plot a line for each individual food group, we simply put the price index for each year
for food in food_groups:
    cpi_food = cpi_index[cpi_index['Consumer Price Index item'] == food]
    plt.plot(cpi_food['Year'], cpi_food['Price Index'], label=food)

plt.title('Indexed Price of CPI Food Groups from 2010 to 2022')
plt.xlabel('Year')
plt.ylabel('Indexed Price')
plt.legend()
plt.show()
No description has been provided for this image

Step 2

We will now want to see how obesity, overwieght, and inactivity rates have changed over time. To start with, we will use the national totals for each response, just to get an overview of the national rates.

The graphs we get are interesting. For the inactivity graph, we see strange patterns in constant spiking and dropping in percents, with a final drop in 2020, most likely due to COVID, which does not recover to higher values as other drops did in the graph.

For the obesity graph we see the expected rise in obesity rates over time. For the overweight graph we see that rates have actually decreased. While this may seem like a good trend at first glance, it is more the fact that overweight Americans are simply being outpaced by obese Americans. More Americans are becoming obese rather than overweight, which is the reason that overweight values are falling while obese rates increase.

InΒ [28]:
#we create a reusable function to allow for easier reproduction of new graphs. we take in which dataframe,
#which question, which demographic to graph and which region to use
def graph_questions(df, question, strat, region):
    plt.figure(figsize=(10,10))
    #we filter the main dataframe given down to only including the group featued in the Question->Strata->Region classification
    sub1 = df[df['Question'] == question]
    sub2 = sub1[sub1['Stratification1'] == strat]
    fixed_df = sub2[sub2['Region'] == region]
    x = fixed_df['Year']
    y = fixed_df['Data_Value']
    plt.plot(x, y)
    plt.title(region + ' ' + question + ' with classifier: ' + strat)
    plt.xlabel('Year')
    plt.ylabel('Percent')
    plt.show()

#graph for each question of interest
for q in nut_list:
    graph_questions(reg_avgs, q, 'Total', 'National')
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image

Step 3

We will now want to look at how different groups have trended compared to the national average. Unfortunately, there are too many combinations to give a graph for every possible one, so we will focus on groups determined to be at higher risks for these health problems. Looking at social and economic risk factors, indivuduals with lower income values are at higher risks of obesity, as well as Black and Hispanic adult Americans. We will choose to focus only on obesity rates for these groups, as we saw before, overweight classifications may not be accurate for our data.

Upon graphing the risk group obesity rates across the nation, we see that all have them have higher obesity rates over the same period as the national total average.

InΒ [29]:
risk_groups = ['Less than $15,000', '$15,000 - $24,999', '$25,000 - $34,999', '$35,000 - $49,999', 'Non-Hispanic Black', 'Hispanic']

q = 'Percent of adults aged 18 years and older who have obesity'
for grp in risk_groups:
    graph_questions(reg_avgs, q, grp, 'National')
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image

Step 4

To really see the difference between the at risk and not at risk groups mentioned before, we will graph their obesity trends on the same graph. To do this, we will create a new column in our dataframe that classifies whether a responder falls in an at risk group or not at risk group.

Upon graphing the two trends, we see a large difference in obeisty rates between the two groups. The at risk group has as much as 5% higher obesity rates than the non at risk groups, showing a big difference between the two.

InΒ [30]:
#adding the risk column by checking our risk_groups list from before
reg_avgs['Risk'] = reg_avgs['Stratification1'].apply(lambda r: 'At Risk' if r in risk_groups else 'Not At Risk')

plt.figure(figsize=(10,10))
x = reg_avgs['Year'].unique()
#finding the average rates among the risk and non risk groups
yrisk = reg_avgs[reg_avgs['Risk'] == 'At Risk'].groupby('Year')['Data_Value'].mean().reset_index()['Data_Value']
ynot = reg_avgs[reg_avgs['Risk'] == 'Not At Risk'].groupby('Year')['Data_Value'].mean().reset_index()['Data_Value']

plt.plot(x, yrisk, label='Risk Group')
plt.plot(x, ynot, label='Not At Risk Group')
plt.title('Average National Obesity Rates of Risk Group Compared to Not at Risk Group')
plt.xlabel('Year')
plt.ylabel('Percent')
plt.legend()
plt.show()
No description has been provided for this image

Step 5

To furhter our analysis, we are going to merge our two dataframes together. This will help allow us to see if relationships between food prices and obesity and inactivity rates exist. We will accomplish this by a simple merge function of the data, and for our first analyses, we will look at National obesity rates compares to the indexed price of food. We will do this by creating scatter plots of obesity rates over the index prices for the groups we choose. We will follow before and look at differences between groups at risk and groups not at risk. We will add lines of best fit to see if linear trends exist.

The method of graphing will be using seaborn's regplot function, which will create a scatter plot between the x (Price Index) and y (Obesity Rates) variables, as well as adding lines of best fits to each graph, as well as a confidence interval of possible ranges of the lines.

InΒ [31]:
merged = pd.merge(reg_avgs, cpi_index, on='Year', how='inner')
merged.head()
Out[31]:
Year Question Stratification1 Region Data_Value Risk Consumer Price Index item Percent change Price Index
0 2011 Percent of adults aged 18 years and older who ... $15,000 - $24,999 Midwest 34.018182 At Risk All food 3.7 103.7
1 2011 Percent of adults aged 18 years and older who ... $15,000 - $24,999 Midwest 34.018182 At Risk Beef and veal 10.2 110.2
2 2011 Percent of adults aged 18 years and older who ... $15,000 - $24,999 Midwest 34.018182 At Risk Cereals and bakery products 3.9 103.9
3 2011 Percent of adults aged 18 years and older who ... $15,000 - $24,999 Midwest 34.018182 At Risk Dairy products 6.8 106.8
4 2011 Percent of adults aged 18 years and older who ... $15,000 - $24,999 Midwest 34.018182 At Risk Eggs 9.2 109.2
InΒ [32]:
#filter to see only national and and all food prices
tot_and_nat = merged[(merged['Region'] == 'National') & (merged['Consumer Price Index item'] == 'All food') & (merged['Question'] == 'Percent of adults aged 18 years and older who have obesity')].groupby(['Year', 'Risk', 'Percent change', 'Price Index']).agg({
    'Data_Value': 'mean'
}).reset_index()

plt.figure(figsize=(10, 10))
xrisk = tot_and_nat[tot_and_nat['Risk'] == 'At Risk']['Price Index']
yrisk = tot_and_nat[tot_and_nat['Risk'] == 'At Risk']['Data_Value']
xnot = tot_and_nat[tot_and_nat['Risk'] == 'Not At Risk']['Price Index']
ynot = tot_and_nat[tot_and_nat['Risk'] == 'Not At Risk']['Data_Value']
sns.regplot(x=xrisk, y=yrisk, label='At Risk')
sns.regplot(x=xnot, y=ynot, label='Not At Risk')

plt.title('National Obesity Rates of At Risk and Not At Risk Groups Compared to Indexed Price of Food')
plt.xlabel('Indexed Price of Food')
plt.ylabel('Obesity Percentage')
plt.legend()
plt.show()
No description has been provided for this image

Step 6

Looking at the data of all food, we do see trends of higher indexed food prices with obesity rates, with the risk group having higher rates than the non-risk group.

To continue our analysis, we will look at the following foods of interest and create similar graphs: Meat, Fats and Oils, Fruits and Vegetables, and Sugar and Sweets.

Looking at the graphs, save for the Fats and Oils graph, which does not seem appropriate to graph, as the food price did not change as much as the others, we see similar trends. Higher food prices led to higher obesity rates, with fats and oils have the smallest trend of food groups.

InΒ [35]:
interest_food = ['Meats', 'Fats and oils', 'Fruits and vegetables', 'Sugar and sweets']
for food in interest_food:
    df = merged[(merged['Region'] == 'National') & (merged['Consumer Price Index item'] == food) & (merged['Question'] == 'Percent of adults aged 18 years and older who have obesity')].groupby(['Year', 'Risk', 'Percent change', 'Price Index']).agg({
        'Data_Value': 'mean'
    }).reset_index()

    plt.figure(figsize=(10, 10))
    xrisk = df[tot_and_nat['Risk'] == 'At Risk']['Price Index']
    yrisk = df[df['Risk'] == 'At Risk']['Data_Value']
    xnot = df[df['Risk'] == 'Not At Risk']['Price Index']
    ynot = df[df['Risk'] == 'Not At Risk']['Data_Value']
    sns.regplot(x=xrisk, y=yrisk, label='At Risk')
    sns.regplot(x=xnot, y=ynot, label='Not At Risk')

    plt.title('National Obesity Rates of At Risk and Not At Risk Groups Compared to Indexed Price of ' + food)
    plt.xlabel('Indexed Price of ' + food)
    plt.ylabel('Obesity Percentage')
    plt.legend()
plt.show()
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image

Step 7

Our next step to see if annual percent changes in food had any impact on obesity rates. To do this we will follow a very similar process, simply replacing our x axis with the percent change in food price, rather than the price index.

Upon graphing, we see interesting trends. We see that in years with higher percent increases, obesity rates are somewhat higher. These graphs are not as impactful as previously for the indexed price. The issue for this may be the latent affects of annual percent changes in foods. If food prices change in a year, it may take longer for individuals to change their BMI classification than in a year, meaning a percent change in one year may have more of an impact in the next year rather than the current one.

InΒ [36]:
for food in interest_food:
    df = merged[(merged['Region'] == 'National') & (merged['Consumer Price Index item'] == food) & (merged['Question'] == 'Percent of adults aged 18 years and older who have obesity')].groupby(['Year', 'Risk', 'Percent change', 'Price Index']).agg({
        'Data_Value': 'mean'
    }).reset_index()
    plt.figure(figsize=(10, 10))
    xrisk = df[df['Risk'] == 'At Risk']['Percent change']
    yrisk = df[df['Risk'] == 'At Risk']['Data_Value']
    xnot = df[df['Risk'] == 'Not At Risk']['Percent change']
    ynot = df[df['Risk'] == 'Not At Risk']['Data_Value']
    sns.regplot(x=xrisk, y=yrisk, label='At Risk')
    sns.regplot(x=xnot, y=ynot, label='Not At Risk')

    plt.title('National Obesity Rates of At Risk and Not At Risk Groups Compared to Percent Change in Price of ' + food)
    plt.xlabel('Percent Change in Price of ' + food)
    plt.ylabel('Obesity Percentage')
    plt.legend()
plt.show()
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image

Step 8

Our next step will be to reproduce these food trends for physical activity rates. It could be thought that if food is more expensive in a year, individuals will change their physical activity habits to account for either an increase of ease in purchasing food, or a restriction in access to food.

We follow a similar metohd as before, using seaborn's regplot function, our only difference is that we replace the question we filter by to the physical activity question.

On looking at the graphs, we see a different trend. A difference is created between at-risk and non-risk groups in their response. As prices increase, at-risk groups tend to increase their physical activity, while non-risk groups decrease their physical activity

InΒ [37]:
interest_food.append('All food')
for food in interest_food:
    df = merged[(merged['Region'] == 'National') & (merged['Consumer Price Index item'] == food) & (merged['Question'] == 'Percent of adults who engage in no leisure-time physical activity')].groupby(['Year', 'Risk', 'Percent change', 'Price Index']).agg({
        'Data_Value': 'mean'
    }).reset_index()
    plt.figure(figsize=(10, 10))
    xrisk = df[df['Risk'] == 'At Risk']['Price Index']
    yrisk = df[df['Risk'] == 'At Risk']['Data_Value']
    xnot = df[df['Risk'] == 'Not At Risk']['Price Index']
    ynot = df[df['Risk'] == 'Not At Risk']['Data_Value']
    sns.regplot(x=xrisk, y=yrisk, label='At Risk')
    sns.regplot(x=xnot, y=ynot, label='Not At Risk')

    plt.title('National Rate of Adults Engaging in no Leisure-Time Physical Acticity for At Risk and Not At Risk Groups Compared to Indexed Price of ' + food)
    plt.xlabel('Indexed Price of ' + food)
    plt.ylabel('Percent of Inactive Adults')
    plt.legend()
plt.show()
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image

Step 9

We are now going to continue our visual analysis to regional variation. We want to see if different regions of the U.S. have different obesity rate and inactivity responses to food prices.

We are going to follow a very similar method as before. The difference now is that we replace the region we filter by iterating through the regions list.

After looking at the graphs produced for obesity, we see geographical differences do exist. Our main variation exists between the West and Southeast regions. We see that the Western region has lower obesity response rates to food price, while the southeast region has much higher obesity response rates.

For the inactivity graphs, we also see similar geographical divides. The difference between risk and non-risk groups exist as seen before from food groups, but we see that the divide is much more pronounced in the Southeast and Northeast regions, while the West has less of a pronounced effect.

InΒ [39]:
for region in set(regions.values()):
    df = merged[(merged['Region'] == region) & (merged['Consumer Price Index item'] == 'All food') & (merged['Question'] == 'Percent of adults aged 18 years and older who have obesity')].groupby(['Year', 'Risk', 'Percent change', 'Price Index']).agg({
        'Data_Value': 'mean'
    }).reset_index()
    plt.figure(figsize=(10, 10))
    xrisk = df[df['Risk'] == 'At Risk']['Price Index']
    yrisk = df[df['Risk'] == 'At Risk']['Data_Value']
    xnot = df[df['Risk'] == 'Not At Risk']['Price Index']
    ynot = df[df['Risk'] == 'Not At Risk']['Data_Value']
    sns.regplot(x=xrisk, y=yrisk, label='At Risk')
    sns.regplot(x=xnot, y=ynot, label='Not At Risk')

    plt.title(region + ' Obesity Rates of At Risk and Not At Risk Groups Compared to Indexed Price of Food')
    plt.xlabel('Indexed Price of Food')
    plt.ylabel('Obesity Percentage')
    plt.legend()
plt.show()
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
InΒ [40]:
for region in set(regions.values()):
    df = merged[(merged['Region'] == region) & (merged['Consumer Price Index item'] == 'All food') & (merged['Question'] == 'Percent of adults who engage in no leisure-time physical activity')].groupby(['Year', 'Risk', 'Percent change', 'Price Index']).agg({
        'Data_Value': 'mean'
    }).reset_index()
    plt.figure(figsize=(10, 10))
    xrisk = df[df['Risk'] == 'At Risk']['Price Index']
    yrisk = df[df['Risk'] == 'At Risk']['Data_Value']
    xnot = df[df['Risk'] == 'Not At Risk']['Price Index']
    ynot = df[df['Risk'] == 'Not At Risk']['Data_Value']
    sns.regplot(x=xrisk, y=yrisk, label='At Risk')
    sns.regplot(x=xnot, y=ynot, label='Not At Risk')

    plt.title(region + ' Rate of Adults Engaging in no Leisure-Time Physical Acticity for At Risk and Not At Risk Groups Compared to Indexed Price of ' + food)
    plt.xlabel('Indexed Price of ' + food)
    plt.ylabel('Percent of Inactive Adults')
    plt.legend()
plt.show()
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image

Step 10

We will now see the trends specifically for the at risk groups mentioned before. To do this we will continue with our filtered df, but we will only graph one set of points of the percent response rate over the price index of food.

When looking at the graphs produced by the obesity question, we see all groups have a similar pronounced increase in obesity rates when compared to the price index of food increasing. This is more prononuced than the national graph produced way back in the page.

When looking at the graphs produced by the inactivity question, we see much different trends. For income, it is noticable that as price indices of food increased, the rate of increase in inactivity rates seemed to decrease, suggesting that as income increased, activity rates decreased when food prices went up. For Black and Hispanic adults, we see a new trend. We see that Black adults become much more inactive as price increases, while Hispanic adults only barely become more inactive as price increases

InΒ [41]:
for group in risk_groups:
    df = merged[(merged['Region'] == 'National') & (merged['Consumer Price Index item'] == food) & (merged['Question'] == 'Percent of adults aged 18 years and older who have obesity') & (merged['Stratification1'] == group)].groupby(['Year', 'Stratification1', 'Percent change', 'Price Index']).agg({
        'Data_Value': 'mean'
    }).reset_index()
    plt.figure(figsize=(10, 10))
    x = df['Price Index']
    y = df['Data_Value']
    sns.regplot(x=x, y=y)

    plt.title('National Rate of Adults (' + group + ') Engaging in no Leisure-Time Physical Acticity for At Risk and Not At Risk Groups Compared to Indexed Price of ' + food)

    plt.title('National Obesity Rates of Adults (' + group + ') Compared to Indexed Price of food')
    plt.xlabel('Indexed Price of food')
    plt.ylabel('Obesity Percentage')
plt.show()
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
InΒ [42]:
for group in risk_groups:
    df = merged[(merged['Region'] == 'National') & (merged['Consumer Price Index item'] == food) & (merged['Question'] == 'Percent of adults who engage in no leisure-time physical activity') & (merged['Stratification1'] == group)].groupby(['Year', 'Stratification1', 'Percent change', 'Price Index']).agg({
        'Data_Value': 'mean'
    }).reset_index()
    plt.figure(figsize=(10, 10))
    x = df['Price Index']
    y = df['Data_Value']
    sns.regplot(x=x, y=y)

    plt.title('National Rate of Adults (' + group + ') Engaging in no Leisure-Time Physical Acticity for At Risk and Not At Risk Groups Compared to Indexed Price of ' + food)
    plt.xlabel('Indexed Price of Food')
    plt.ylabel('Percent of Inactive Adults')
plt.show()
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image

Data AnalysisΒΆ

We will now begin to work on more advanced methods on analyzing our data. We will begin by looking at a linear regression model, to see if our socioeconomic and food price factors have linear relationships with obesity and inactivity rates of adults.

Our goal is to see if we can find statistically significant factors that put individuals at risk for higher obesity rates.

Step 1

To begin, we are going to reduce our dataframe to our columns of interest in the model. We are then going to split the merged dataframe into two, one including only obesity response rates, and the other inactivity response rates.

We will begin with the obesity linear model. To begin we will look at every variable and see the total effect of each variable: year, strat groups, risk group, region, price index of food, and percent change.

Looking at the results of this first model, we see some glaring issues. Some coefficients are extreme, and our condition number is enormous. This means that the model is not doing a simple unbiased linear regression, there exists large amounts of covariance and multicollinearity. This means that our independent variables are affecting each other before impacting the actual percent of obesity. We will need to change our model

InΒ [43]:
merged.columns = ['Year', 'Question', 'Strat', 'Region', 'Data_Value', 'Risk', 'CPI_item', 'Percent_Change', 'Price_Index']
obese = merged[merged['Question'] == 'Percent of adults aged 18 years and older who have obesity']
inactive = merged[merged['Question'] == 'Percent of adults who engage in no leisure-time physical activity']

linmodel = sm.ols(formula='Data_Value ~ Year + Strat + Risk + Region + Price_Index + Percent_Change', data=obese).fit()
print(linmodel.summary2())
                                  Results: Ordinary least squares
====================================================================================================
Model:                          OLS                        Adj. R-squared:               0.901      
Dependent Variable:             Data_Value                 AIC:                          179850.4568
Date:                           2024-05-18 18:00           BIC:                          180152.7134
No. Observations:               41602                      Log-Likelihood:               -89890.    
Df Model:                       34                         F-statistic:                  1.118e+04  
Df Residuals:                   41567                      Prob (F-statistic):           0.00       
R-squared:                      0.901                      Scale:                        4.4123     
----------------------------------------------------------------------------------------------------
                                            Coef.    Std.Err.     t     P>|t|    [0.025     0.975]  
----------------------------------------------------------------------------------------------------
Intercept                                 -1179.6479   7.4115 -159.1644 0.0000 -1194.1746 -1165.1211
Strat[T.$25,000 - $34,999]                   -1.6703   0.0746  -22.3780 0.0000    -1.8166    -1.5240
Strat[T.$35,000 - $49,999]                   -2.1223   0.0746  -28.4344 0.0000    -2.2686    -1.9760
Strat[T.$50,000 - $74,999]                    1.4619   0.0517   28.2943 0.0000     1.3606     1.5632
Strat[T.$75,000 or greater]                  -2.4163   0.0517  -46.7654 0.0000    -2.5176    -2.3150
Strat[T.18 - 24]                            -12.1254   0.0517 -234.6788 0.0000   -12.2267   -12.0241
Strat[T.2 or more races]                      1.7661   0.0517   34.1813 0.0000     1.6648     1.8674
Strat[T.25 - 34]                             -1.0108   0.0517  -19.5624 0.0000    -1.1120    -0.9095
Strat[T.35 - 44]                              3.8987   0.0517   75.4569 0.0000     3.7974     4.0000
Strat[T.45 - 54]                              5.9858   0.0517  115.8514 0.0000     5.8846     6.0871
Strat[T.55 - 64]                              4.8285   0.0517   93.4516 0.0000     4.7272     4.9297
Strat[T.65 or older]                         -2.2341   0.0517  -43.2387 0.0000    -2.3353    -2.1328
Strat[T.American Indian/Alaska Native]        6.6008   0.0517  127.7531 0.0000     6.4995     6.7020
Strat[T.Asian]                              -19.2597   0.0517 -372.7568 0.0000   -19.3609   -19.1584
Strat[T.College graduate]                    -5.8086   0.0517 -112.4204 0.0000    -5.9098    -5.7073
Strat[T.Female]                               0.2193   0.0517    4.2438 0.0000     0.1180     0.3205
Strat[T.Hawaiian/Pacific Islander]            9.1824   0.0838  109.5361 0.0000     9.0181     9.3467
Strat[T.High school graduate]                 2.3674   0.0517   45.8202 0.0000     2.2662     2.4687
Strat[T.Hispanic]                            -1.4084   0.0746  -18.8686 0.0000    -1.5546    -1.2621
Strat[T.Less than $15,000]                    1.3337   0.0746   17.8690 0.0000     1.1874     1.4800
Strat[T.Less than high school]                4.4957   0.0517   87.0104 0.0000     4.3944     4.5969
Strat[T.Male]                                 0.1802   0.0517    3.4885 0.0005     0.0790     0.2815
Strat[T.Non-Hispanic Black]                   3.6842   0.0746   49.3590 0.0000     3.5379     3.8305
Strat[T.Non-Hispanic White]                  -1.3819   0.0517  -26.7458 0.0000    -1.4832    -1.2806
Strat[T.Other]                               -2.9152   0.0547  -53.3133 0.0000    -3.0224    -2.8081
Strat[T.Some college or technical school]     1.9197   0.0517   37.1550 0.0000     1.8185     2.0210
Strat[T.Total]                                0.2071   0.0517    4.0088 0.0001     0.1059     0.3084
Risk[T.Not At Risk]                          -4.0382   0.0517  -78.1566 0.0000    -4.1395    -3.9369
Region[T.National]                           -2.2397   0.0358  -62.5842 0.0000    -2.3099    -2.1696
Region[T.Northeast]                          -3.7239   0.0359 -103.6718 0.0000    -3.7943    -3.6535
Region[T.Southeast]                           0.1112   0.0359    3.0926 0.0020     0.0407     0.1816
Region[T.Southwest]                          -0.8441   0.0360  -23.4656 0.0000    -0.9146    -0.7736
Region[T.West]                               -4.4237   0.0358 -123.6098 0.0000    -4.4938    -4.3535
Year                                          0.6030   0.0037  162.2880 0.0000     0.5957     0.6103
Price_Index                                  -0.0023   0.0013   -1.8333 0.0668    -0.0049     0.0002
Percent_Change                                0.0202   0.0027    7.4551 0.0000     0.0149     0.0256
----------------------------------------------------------------------------------------------------
Omnibus:                      9351.181              Durbin-Watson:                 0.097            
Prob(Omnibus):                0.000                 Jarque-Bera (JB):              456937.672       
Skew:                         0.128                 Prob(JB):                      0.000            
Kurtosis:                     19.234                Condition No.:                 32869954963714204
====================================================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 1.57e-22. This might indicate that                there are strong
multicollinearity problems or that the design                matrix is singular.

Step 2

We will look to see the covariance between our numerical values. We see that year has a high covariance with every other value, meaning it could be affecting things other than obesity rate. We also see a high covariance between prince_index and percent_change, which is to be expected, they are formed from each other.

To fix our model, we are going to remove year from our equation, as well as separating price index and percent change.

When looking at the new model with only price index, our model becomes much more realistic, with the condition number coming down to 3156. That is still not ideal, but it is workable. We see almost every variable is significant, save for the 35-44 group, and the Southeast region. We see that the Asian, and 18-24 groups are strongly impacting the obesity rate, having the largest decrease in rates.

When looking at the model with percent change, it becomes even more manageable, with a condition number of 145. This model follows similar trends to the previous one with price index, but it suggests that Percent Change is less statistically significant than Price Index, with a lower t value.

InΒ [45]:
obese[['Year', 'Data_Value', 'Percent_Change', 'Price_Index']].corr()
Out[45]:
Year Data_Value Percent_Change Price_Index
Year 1.000000 0.313128 0.218156 0.582468
Data_Value 0.313128 1.000000 0.081219 0.186722
Percent_Change 0.218156 0.081219 1.000000 0.538754
Price_Index 0.582468 0.186722 0.538754 1.000000
InΒ [46]:
olinmodel = sm.ols(formula='Data_Value ~ Strat + Region + Price_Index', data=obese).fit()
print(olinmodel.summary2())
                               Results: Ordinary least squares
==============================================================================================
Model:                        OLS                      Adj. R-squared:             0.838      
Dependent Variable:           Data_Value               AIC:                        200409.3330
Date:                         2024-05-18 18:07         BIC:                        200694.3178
No. Observations:             41602                    Log-Likelihood:             -1.0017e+05
Df Model:                     32                       F-statistic:                6737.      
Df Residuals:                 41569                    Prob (F-statistic):         0.00       
R-squared:                    0.838                    Scale:                      7.2328     
----------------------------------------------------------------------------------------------
                                           Coef.   Std.Err.     t     P>|t|   [0.025   0.975] 
----------------------------------------------------------------------------------------------
Intercept                                  23.8564   0.1508  158.1780 0.0000  23.5607  24.1520
Strat[T.$25,000 - $34,999]                 -1.6703   0.0956  -17.4784 0.0000  -1.8576  -1.4830
Strat[T.$35,000 - $49,999]                 -2.1223   0.0956  -22.2087 0.0000  -2.3096  -1.9350
Strat[T.$50,000 - $74,999]                 -2.5763   0.0956  -26.9590 0.0000  -2.7636  -2.3890
Strat[T.$75,000 or greater]                -6.4545   0.0956  -67.5414 0.0000  -6.6418  -6.2672
Strat[T.18 - 24]                          -16.1636   0.0956 -169.1404 0.0000 -16.3509 -15.9763
Strat[T.2 or more races]                   -2.2721   0.0956  -23.7761 0.0000  -2.4594  -2.0848
Strat[T.25 - 34]                           -5.0490   0.0956  -52.8336 0.0000  -5.2363  -4.8617
Strat[T.35 - 44]                           -0.1395   0.0956   -1.4596 0.1444  -0.3268   0.0478
Strat[T.45 - 54]                            1.9476   0.0956   20.3804 0.0000   1.7603   2.1349
Strat[T.55 - 64]                            0.7903   0.0956    8.2695 0.0000   0.6030   0.9776
Strat[T.65 or older]                       -6.2723   0.0956  -65.6346 0.0000  -6.4596  -6.0850
Strat[T.American Indian/Alaska Native]      2.5626   0.0956   26.8153 0.0000   2.3753   2.7499
Strat[T.Asian]                            -23.2979   0.0956 -243.7948 0.0000 -23.4852 -23.1106
Strat[T.College graduate]                  -9.8468   0.0956 -103.0391 0.0000 -10.0341  -9.6595
Strat[T.Female]                            -3.8189   0.0956  -39.9624 0.0000  -4.0062  -3.6316
Strat[T.Hawaiian/Pacific Islander]          5.1237   0.1303   39.3077 0.0000   4.8682   5.3791
Strat[T.High school graduate]              -1.6708   0.0956  -17.4833 0.0000  -1.8581  -1.4835
Strat[T.Hispanic]                          -1.4084   0.0956  -14.7374 0.0000  -1.5957  -1.2210
Strat[T.Less than $15,000]                  1.3337   0.0956   13.9566 0.0000   1.1464   1.5210
Strat[T.Less than high school]              0.4575   0.0956    4.7869 0.0000   0.2702   0.6448
Strat[T.Male]                              -3.8580   0.0956  -40.3707 0.0000  -4.0453  -3.6707
Strat[T.Non-Hispanic Black]                 3.6842   0.0956   38.5520 0.0000   3.4969   3.8715
Strat[T.Non-Hispanic White]                -5.4201   0.0956  -56.7175 0.0000  -5.6074  -5.2328
Strat[T.Other]                             -7.0591   0.0985  -71.6405 0.0000  -7.2523  -6.8660
Strat[T.Some college or technical school]  -2.1185   0.0956  -22.1683 0.0000  -2.3058  -1.9312
Strat[T.Total]                             -3.8311   0.0956  -40.0894 0.0000  -4.0184  -3.6438
Region[T.National]                         -2.2368   0.0458  -48.8181 0.0000  -2.3266  -2.1470
Region[T.Northeast]                        -3.7186   0.0460  -80.8570 0.0000  -3.8087  -3.6285
Region[T.Southeast]                         0.1086   0.0460    2.3603 0.0183   0.0184   0.1988
Region[T.Southwest]                        -0.8420   0.0461  -18.2811 0.0000  -0.9323  -0.7517
Region[T.West]                             -4.4208   0.0458  -96.4823 0.0000  -4.5106  -4.3309
Price_Index                                 0.1063   0.0011   93.3431 0.0000   0.1041   0.1086
----------------------------------------------------------------------------------------------
Omnibus:                      4811.513               Durbin-Watson:                  0.269    
Prob(Omnibus):                0.000                  Jarque-Bera (JB):               49004.552
Skew:                         0.070                  Prob(JB):                       0.000    
Kurtosis:                     8.315                  Condition No.:                  3156     
==============================================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 3.16e+03. This might indicate                that there are
strong multicollinearity or other numerical                problems.
InΒ [47]:
olinmodel = sm.ols(formula='Data_Value ~ Strat + Region + Percent_Change', data=obese).fit()
print(olinmodel.summary2())
                               Results: Ordinary least squares
==============================================================================================
Model:                        OLS                      Adj. R-squared:             0.811      
Dependent Variable:           Data_Value               AIC:                        206972.1435
Date:                         2024-05-18 18:07         BIC:                        207257.1283
No. Observations:             41602                    Log-Likelihood:             -1.0345e+05
Df Model:                     32                       F-statistic:                5564.      
Df Residuals:                 41569                    Prob (F-statistic):         0.00       
R-squared:                    0.811                    Scale:                      8.4687     
----------------------------------------------------------------------------------------------
                                           Coef.   Std.Err.     t     P>|t|   [0.025   0.975] 
----------------------------------------------------------------------------------------------
Intercept                                  35.8061   0.0804  445.4667 0.0000  35.6486  35.9637
Strat[T.$25,000 - $34,999]                 -1.6703   0.1034  -16.1527 0.0000  -1.8730  -1.4676
Strat[T.$35,000 - $49,999]                 -2.1223   0.1034  -20.5243 0.0000  -2.3250  -1.9197
Strat[T.$50,000 - $74,999]                 -2.5763   0.1034  -24.9143 0.0000  -2.7790  -2.3736
Strat[T.$75,000 or greater]                -6.4545   0.1034  -62.4187 0.0000  -6.6572  -6.2518
Strat[T.18 - 24]                          -16.1636   0.1034 -156.3118 0.0000 -16.3663 -15.9609
Strat[T.2 or more races]                   -2.2721   0.1034  -21.9728 0.0000  -2.4748  -2.0694
Strat[T.25 - 34]                           -5.0490   0.1034  -48.8264 0.0000  -5.2516  -4.8463
Strat[T.35 - 44]                           -0.1395   0.1034   -1.3489 0.1774  -0.3422   0.0632
Strat[T.45 - 54]                            1.9476   0.1034   18.8346 0.0000   1.7449   2.1503
Strat[T.55 - 64]                            0.7903   0.1034    7.6423 0.0000   0.5876   0.9929
Strat[T.65 or older]                       -6.2723   0.1034  -60.6565 0.0000  -6.4749  -6.0696
Strat[T.American Indian/Alaska Native]      2.5626   0.1034   24.7815 0.0000   2.3599   2.7652
Strat[T.Asian]                            -23.2979   0.1034 -225.3040 0.0000 -23.5005 -23.0952
Strat[T.College graduate]                  -9.8468   0.1034  -95.2240 0.0000 -10.0494  -9.6441
Strat[T.Female]                            -3.8189   0.1034  -36.9314 0.0000  -4.0216  -3.6163
Strat[T.Hawaiian/Pacific Islander]          5.1153   0.1410   36.2664 0.0000   4.8388   5.3917
Strat[T.High school graduate]              -1.6708   0.1034  -16.1573 0.0000  -1.8734  -1.4681
Strat[T.Hispanic]                          -1.4084   0.1034  -13.6196 0.0000  -1.6110  -1.2057
Strat[T.Less than $15,000]                  1.3337   0.1034   12.8980 0.0000   1.1311   1.5364
Strat[T.Less than high school]              0.4575   0.1034    4.4239 0.0000   0.2548   0.6601
Strat[T.Male]                              -3.8580   0.1034  -37.3088 0.0000  -4.0606  -3.6553
Strat[T.Non-Hispanic Black]                 3.6842   0.1034   35.6280 0.0000   3.4815   3.8868
Strat[T.Non-Hispanic White]                -5.4201   0.1034  -52.4157 0.0000  -5.6228  -5.2174
Strat[T.Other]                             -7.1936   0.1066  -67.4722 0.0000  -7.4025  -6.9846
Strat[T.Some college or technical school]  -2.1185   0.1034  -20.4869 0.0000  -2.3212  -1.9158
Strat[T.Total]                             -3.8311   0.1034  -37.0488 0.0000  -4.0338  -3.6284
Region[T.National]                         -2.2346   0.0496  -45.0717 0.0000  -2.3318  -2.1375
Region[T.Northeast]                        -3.7158   0.0498  -74.6681 0.0000  -3.8133  -3.6183
Region[T.Southeast]                         0.1090   0.0498    2.1885 0.0286   0.0114   0.2066
Region[T.Southwest]                        -0.8401   0.0498  -16.8570 0.0000  -0.9378  -0.7424
Region[T.West]                             -4.4186   0.0496  -89.1207 0.0000  -4.5158  -4.3214
Percent_Change                              0.1164   0.0031   37.0790 0.0000   0.1102   0.1225
----------------------------------------------------------------------------------------------
Omnibus:                      3474.612               Durbin-Watson:                  0.080    
Prob(Omnibus):                0.000                  Jarque-Bera (JB):               22515.428
Skew:                         -0.004                 Prob(JB):                       0.000    
Kurtosis:                     6.604                  Condition No.:                  145      
==============================================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Step 3

We want to see a visual representation of the reliability of the model. To do this we plot the residuals (error terms in predicted and actual values of the model) and the fitted values of the model. We see a very large range of residuals, and a funnel shape in the graph, indicating that this model is missing another data that impacts obesity rates.

Looking at all of this, we can assume that this data does not follow a linear relationship. We are not going to bother with the inactivity rates with a linear model, as it shares the same independent values and the errors that come with them. We are going to move to a different type of model

InΒ [48]:
plt.figure(figsize=(8,6))
plt.scatter(olinmodel.fittedvalues, olinmodel.resid)
plt.axhline(y=0, color='orange', linestyle='-')
plt.title('Obesity Linear Model Fitted Values over Residuals')
plt.xlabel('Fitted Values')
plt.ylabel('Residuals')
plt.show()
No description has been provided for this image

Step 4

We are going to use a random forest to model this data. Random forests are helpful in that they can help deal with models that are not linear in nature. Random forests use a set of decision trees to help predict values based on independent values. It will help with the classification nature of our data, between strata and region. The random forest will also help show which variables are most influential in affecting obesity rates.

To do this, we will use sklearns and statsmodels frameworks. Our first step will be to one hot encode our data. This means that we will replace our categorical data into arrays that put a 1 for the category they match, and 0 elsewhere. This will create a new column for each stratification data. We will then scale our date to use the same unit, removing unit confusion in our data. We will separate our data into a training and testing set, to see a better representation of data in the set.

To best see if our random forest is accurate, we will take the mean squared error of its results, which compare the difference between its predicted obesity/inactivity rates compared to the actual data for each row, as well as listing out and graphing our feature importance, to see if some are over powering others, as well as graphingour residuals and predicted vs actual data.

Upon our first random forest for the obesity set, we see a relatively small mean squared error of 3, meaning we are on average off by about 3% in percentages (which is less than 10% of average values). We also see we have two very strong features: Asian, and 18-24. We saw this in the linear model as well. These two values are outliers, and may be negatively impacting our data, especially as these groups have the lowest obesity rates in the country. We will then filter these two out for another run of our random forest model.

We also see that our predicted values follow the actual values at a decent rate in our second graph. Our goal is for our data to be as close to the orange line as possible, and it is much close than the linear model.

InΒ [50]:
# a function to allow for easy repeatability
def rf_analysis(df, data):
    #scale our data and create a list of our features, including the new dummy ones
    scaler = StandardScaler()
    feats = ['Percent_Change', 'Price_Index'] + [c for c in df.columns if c.startswith('Strat') or c.startswith('Region') or c.startswith('Risk')]
    x = scaler.fit_transform(df[feats])
    y = df['Data_Value'].values
    #split our data, 90% training and 10% to test
    x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.1, random_state=30)

    #create our forest and calc our MSE
    forest = RandomForestRegressor(n_estimators=100, random_state=30)
    forest.fit(x_train, y_train)
    y_pred = forest.predict(x_test)
    mse = mean_squared_error(y_pred, y_test)
    print(f"Mean Square Error of {data} Random Forest Model: {mse}")
    #list out the importance for each feature
    forest_list = forest.feature_importances_
    forest_df = pd.DataFrame({'Feature': feats, 'Importance': forest_list})
    print(forest_df)

    #graph our importance
    plt.figure(figsize=(8,6))
    plt.barh(forest_df['Feature'], forest_df['Importance'])
    plt.xlabel('Importance')
    plt.ylabel('Feature')
    plt.title('Feature Importance in the ' + data + ' Random Forest')
    plt.show()

    #graph the actual over predicted values, as well as a line y = x
    plt.figure(figsize=(8,6))
    plt.scatter(y_test, y_pred)
    plt.plot([min(y_test), max(y_test)], [min(y_test), max(y_test)], color='orange')
    plt.title('Actual Over Predicted Values of the ' + data + ' Random Forest')
    plt.xlabel('Actual Values')
    plt.ylabel('Predicted Values')
    plt.show()

    #graph our residuals over actual values
    plt.figure(figsize=(8,6))
    forest_resid = y_test - y_pred
    plt.scatter(y_pred, forest_resid)
    plt.axhline(y=0, color='orange', linestyle='-')
    plt.title('Residuals Over Predicted Values of the ' + data + ' Random Forest')
    plt.xlabel('Actual Values')
    plt.ylabel('Residuals')
    plt.show()

odummies = pd.get_dummies(obese, columns=['Risk', 'Strat', 'Region'], drop_first=True)
rf_analysis(odummies, 'Obesity')
Mean Square Error of Obesity Random Forest Model: 3.302869382302935
                                   Feature  Importance
0                           Percent_Change    0.051929
1                              Price_Index    0.083600
2                         Risk_Not At Risk    0.009302
3                  Strat_$25,000 - $34,999    0.001307
4                  Strat_$35,000 - $49,999    0.001561
5                  Strat_$50,000 - $74,999    0.001267
6                 Strat_$75,000 or greater    0.015355
7                            Strat_18 - 24    0.159959
8                    Strat_2 or more races    0.003125
9                            Strat_25 - 34    0.006206
10                           Strat_35 - 44    0.005375
11                           Strat_45 - 54    0.014326
12                           Strat_55 - 64    0.007546
13                       Strat_65 or older    0.014492
14     Strat_American Indian/Alaska Native    0.021585
15                             Strat_Asian    0.343933
16                  Strat_College graduate    0.052354
17                            Strat_Female    0.001844
18         Strat_Hawaiian/Pacific Islander    0.036879
19              Strat_High school graduate    0.002026
20                          Strat_Hispanic    0.001652
21                 Strat_Less than $15,000    0.007632
22             Strat_Less than high school    0.007384
23                              Strat_Male    0.002016
24                Strat_Non-Hispanic Black    0.026632
25                Strat_Non-Hispanic White    0.009292
26                             Strat_Other    0.020001
27  Strat_Some college or technical school    0.001469
28                             Strat_Total    0.001760
29                         Region_National    0.011045
30                        Region_Northeast    0.026337
31                        Region_Southeast    0.013147
32                        Region_Southwest    0.008375
33                             Region_West    0.029284
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image

Step 5

We first get rid of our outlier groups, and then run our random forest model again to get a better picture of impactful variables.

We find that our two most impactful are actually the price index and percent change in food prices. We even see this with a similarly small MSE of 3.45, which indicates the model is pretty accurate. Our graphs also show that the predicted values of our model do not stray too far from the actual values as well. We are able to now see our most impactful variables for obesity rates for adults.

These are the food prices, as well as several other strata groups mentioned in the risk groups way back in the document.

InΒ [51]:
obese_outliers_gone = obese[~((obese['Strat'] == 'Asian') | (obese['Strat'] == '18 - 24'))]
oo_dummies = pd.get_dummies(obese_outliers_gone, columns=['Strat', 'Region', 'Risk'], drop_first=True)
rf_analysis(oo_dummies, 'Obesity')
Mean Square Error of Obesity Random Forest Model: 3.4521238743072393
                                   Feature  Importance
0                           Percent_Change    0.100906
1                              Price_Index    0.162334
2                  Strat_$25,000 - $34,999    0.002816
3                  Strat_$35,000 - $49,999    0.003403
4                  Strat_$50,000 - $74,999    0.002643
5                 Strat_$75,000 or greater    0.030941
6                    Strat_2 or more races    0.006292
7                            Strat_25 - 34    0.012959
8                            Strat_35 - 44    0.010933
9                            Strat_45 - 54    0.030708
10                           Strat_55 - 64    0.015340
11                       Strat_65 or older    0.030186
12     Strat_American Indian/Alaska Native    0.043297
13                  Strat_College graduate    0.108267
14                            Strat_Female    0.003861
15         Strat_Hawaiian/Pacific Islander    0.076447
16              Strat_High school graduate    0.004224
17                          Strat_Hispanic    0.003336
18                 Strat_Less than $15,000    0.014805
19             Strat_Less than high school    0.014992
20                              Strat_Male    0.004255
21                Strat_Non-Hispanic Black    0.053239
22                Strat_Non-Hispanic White    0.018548
23                             Strat_Other    0.040477
24  Strat_Some college or technical school    0.003012
25                             Strat_Total    0.003681
26                         Region_National    0.021995
27                        Region_Northeast    0.053997
28                        Region_Southeast    0.024626
29                        Region_Southwest    0.014480
30                             Region_West    0.061191
31                        Risk_Not At Risk    0.021810
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image

Step 6

We will now repeat our analysis for our inactivity dataframe. We see much differnt importance values for inactivity rates. We see the biggest identifiers for inactivity levels are being in our listed risk group, having less than a high school degree, and being high income. We have seen some of these trends before, with higher income individuals decreasing their activity rates. This model is also pretty accurate, with an even smaller MSE of 2.61, with our graphs being more tighlty bound to the ideal rates as well.

InΒ [52]:
inactive_outliers_gone = inactive[~((inactive['Strat'] == 'Asian') | (inactive['Strat'] == '18 - 24'))]
io_dummies = pd.get_dummies(inactive_outliers_gone, columns=['Strat', 'Region', 'Risk'], drop_first=True)
rf_analysis(io_dummies, 'Inactivity')
Mean Square Error of Inactivity Random Forest Model: 2.612063666843684
                                   Feature  Importance
0                           Percent_Change    0.034283
1                              Price_Index    0.041326
2                  Strat_$25,000 - $34,999    0.008151
3                  Strat_$35,000 - $49,999    0.019487
4                  Strat_$50,000 - $74,999    0.003909
5                 Strat_$75,000 or greater    0.082610
6                    Strat_2 or more races    0.001310
7                            Strat_25 - 34    0.006513
8                            Strat_35 - 44    0.000389
9                            Strat_45 - 54    0.006509
10                           Strat_55 - 64    0.017036
11                       Strat_65 or older    0.044173
12     Strat_American Indian/Alaska Native    0.023937
13                  Strat_College graduate    0.091810
14                            Strat_Female    0.009938
15         Strat_Hawaiian/Pacific Islander    0.006280
16              Strat_High school graduate    0.032447
17                          Strat_Hispanic    0.004697
18                 Strat_Less than $15,000    0.051601
19             Strat_Less than high school    0.195045
20                              Strat_Male    0.000254
21                Strat_Non-Hispanic Black    0.004278
22                Strat_Non-Hispanic White    0.000520
23                             Strat_Other    0.004468
24  Strat_Some college or technical school    0.000377
25                             Strat_Total    0.004040
26                         Region_National    0.002896
27                        Region_Northeast    0.005725
28                        Region_Southeast    0.020076
29                        Region_Southwest    0.006600
30                             Region_West    0.064367
31                        Risk_Not At Risk    0.204947
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image

InsightΒΆ

As a whole, the rates of obesity in American adults have been worrying as of late. They have increased over the past several years, and show no sign of stopping in the near future. With the mentioned health risks that come with obesity, actions to help reduce these rates would be highly beneficial to the American population.

Through our analysis, we have found that certain factors of an individual's status or environment can have large impacts on their chance to being obese, or being inactive. We have found that certain groups like Black adults, Hispanic adults, and low-income individuals are very prone to being obese. Analysis in this data has also shown that food price change has increased effects for these groups, meaning higher food prices could lead to higher obesity rates for these groups. Policy that can help reduce food prices in areas predominantly populated by these risk groups could be beneficial in reducing their obesity rates. Other geographical divides could also be mended. Seeing that the Western region has the lowest obesity rates in the country, other regions could try to follow certain policies that the West implements. This could help lower their obesity rates as well.

Looking at inactivity rates, other groups are at risk. Individuals with high income levels, and lower education levels could also be at risk for being more inactive. Considering that physical activity is a crucial element in reducing obesity rates, public policy could be implemented to help encourage more activity. This could include subsidizing gym memberships, increasing open spaces like parks in areas with low activity levels, or providing more free time to American adults.

With inflation of food prices seemingly unending, it is important for policy makers to find ways to limit the increase of food prices. This analysis has helped show that increases in the price of food can negatively impact obesity and inactivity rates, so it is an important thing to consider when looking at the health of American adults.